SecAlign: Defending Against Prompt Injection with Preference Optimization
SecAlign is a defensive framework designed to enhance the robustness of large language models (LLMs) against prompt injection attacks. The framework leverages preference optimization techniques, creating a preference dataset that includes both prompt-injected (insecure) inputs and secure outputs. By performing preference optimization on this dataset, SecAlign teaches the LLM to prefer secure outputs, significantly reducing the success rates of prompt injections.
Key Features:
- Preference Optimization: Utilizes a novel approach to preference learning.
- Robust Against Attacks: Reduces the success rates of various sophisticated prompt injections to nearly 0%.
- Utility Preservation: Maintains usability akin to pre-trained models while enhancing security.
Benefits:
- Increased Security: Protects LLMs from potential manipulations and adversarial prompts.
- Generalization Capability: Effectively mitigates both known and emerging threats.
- Practical Implementation: Adaptable for various LLM frameworks and applications.
Highlights:
- SFT (Supervised Fine-Tuning) methodologies are integrated for enhanced model training.
- Extensive documentation and scripts for easy setup and deployment.
- Community support through GitHub and collaborative development.