llm-security-prompt-injection

Binary Classification: Performing binary classification on a dataset of input prompts to identify malicious prompts that can manipulate LLM behavior.
Methodology: Different approaches are analyzed, including: Classical Machine Learning algorithms (Naive Bayes, Logistic Regression, Support Vector Machine, Random Forest) A pre-trained LLM model (XLM-RoBERTa) without fine-tuning A fine-tuned LLM model (XLM-RoBERTa with training on the dataset)
Dataset: Utilizes the deepset Prompt Injection Dataset, comprising hundreds of samples in English and other languages, pre-split into training and testing subsets.
Results and Analysis: The performance of different classification methods is compared, providing insights into detection capabilities and model accuracy.

This project investigates the security of large language models by classifying prompts to discover malicious injections.