Detailed Introduction
This GitHub project focuses on investigating the security of large language models (LLMs) with a primary emphasis on prompt injection attacks. The study involves:
- Binary Classification: Performing binary classification on a dataset of input prompts to identify malicious prompts that can manipulate LLM behavior.
- Methodology: Different approaches are analyzed, including:
- Classical Machine Learning algorithms (Naive Bayes, Logistic Regression, Support Vector Machine, Random Forest)
- A pre-trained LLM model (XLM-RoBERTa) without fine-tuning
- A fine-tuned LLM model (XLM-RoBERTa with training on the dataset)
- Dataset: Utilizes the deepset Prompt Injection Dataset, comprising hundreds of samples in English and other languages, pre-split into training and testing subsets.
- Results and Analysis: The performance of different classification methods is compared, providing insights into detection capabilities and model accuracy.
Key Features:
- Prompt Injection Detection: Specialized in identifying malicious input prompts targeting LLMs.
- Robust Methodologies: Employs various state-of-the-art ML techniques and frameworks to improve detection accuracy.
- Comprehensive Dataset: Leverages a rich dataset from deepset to ensure robust training and testing of models.
Benefits:
- Enhances understanding of security issues pertaining to LLMs.
- Provides tools and methodologies to improve prompt security in AI applications.
- Aims to contribute valuable findings to the field of AI security research.