InjecGuard
InjecGuard is the first prompt guard model against prompt injection attacks, designed to benchmark and mitigate over-defense issues prevalent in existing models. This repository not only contains the official code implementations but also incorporates various datasets that facilitate thorough evaluations of guardrail models.
Key Features:
- Innovative Model: InjecGuard tackles the common challenge of over-defense in prompt guard models which falsely classify benign inputs as malicious.
- NotInject Dataset: A specialized evaluation dataset created to assess the extent of over-defense, helping to improve model accuracy.
- Open Source: Complete access to model weights and training strategies allows the community to contribute and enhance robustness.
- Pre-trained Checkpoints: Quickly deploy models using Hugging Face Transformers to facilitate seamless integration into existing workflows.
Benefits:
- Robust Defense: Achieves state-of-the-art performance in the field, significantly reducing trigger word bias.
- Comprehensive Evaluations: Includes mechanisms for testing against various datasets ensuring reliable model performance across different conditions.
Highlights:
- Released on Hugging Face for easy deployment.
- Extensive documentation and guidelines for usage, training, and evaluation.
- Effective in real-world applications of AI language models against prompt injection risks.