Newsletter
Join the Community
Subscribe to our newsletter for the latest news and updates
This paper discusses new methods for generating transferable adversarial attacks on aligned language models, improving LLM security.
This paper introduces a novel method for creating universal and transferable adversarial attacks against aligned large language models (LLMs). The authors propose an approach that automatically generates suffixes to be appended to various prompts. By employing a combination of greedy and gradient-based optimization techniques, these adversarial suffixes increase the likelihood that aligned LLMs produce objectionable responses.