LogoAISecKit
icon of SecAlign

SecAlign

Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"

Introduction

SecAlign: Defending Against Prompt Injection with Preference Optimization

SecAlign is a defensive framework designed to enhance the robustness of large language models (LLMs) against prompt injection attacks. The framework leverages preference optimization techniques, creating a preference dataset that includes both prompt-injected (insecure) inputs and secure outputs. By performing preference optimization on this dataset, SecAlign teaches the LLM to prefer secure outputs, significantly reducing the success rates of prompt injections.

Key Features:
  • Preference Optimization: Utilizes a novel approach to preference learning.
  • Robust Against Attacks: Reduces the success rates of various sophisticated prompt injections to nearly 0%.
  • Utility Preservation: Maintains usability akin to pre-trained models while enhancing security.
Benefits:
  • Increased Security: Protects LLMs from potential manipulations and adversarial prompts.
  • Generalization Capability: Effectively mitigates both known and emerging threats.
  • Practical Implementation: Adaptable for various LLM frameworks and applications.
Highlights:
  • SFT (Supervised Fine-Tuning) methodologies are integrated for enhanced model training.
  • Extensive documentation and scripts for easy setup and deployment.
  • Community support through GitHub and collaborative development.

Newsletter

Join the Community

Subscribe to our newsletter for the latest news and updates