Artificial Intelligence & Machine Learning , Next-Generation Technologies & Secure Development
'Many-Shot Jailbreaking' Defeats Gen AI Security Guardrails
'Fictitious Dialogue' About Harmful Content Subverts Defenses, Researchers FindSafety guardrails designed by generative artificial intelligence developers have proved remarkably brittle, and a slew of researchers, online criminals and dilettantes have found ways to provoke chatty bots into behaving in dangerous, insulting or generally odd ways.
See Also: Mitigating Identity Risks, Lateral Movement and Privilege Escalation
Researchers from Anthropic reported that they've discovered a new technique to defeat safety guardrails. They dubbed the method "many-shot jailbreaking" after testing it out on AI tools developed by the likes OpenAI, Meta and their own large language model. Limits programmed into generative AI are meant to stop the tool from answering malicious queries such as "how do I make a bioweapon?" or "what's the best way to build a meth lab?"
In a new paper, Anthropic researchers said they discovered how the "longer context windows" now offered by much-used AI tools can be tricked into divulging prohibited answers or engaging in malicious behavior.
The attack works in part thanks to more recent AI tools accepting much longer inputs.
Previous versions of LLMs accepted prompts up to about the size of an essay - or about 4,000 token. A token is the smallest unit into which an AI model can break down text, akin to words or sub-words. By contrast, newer LLMs can process 10 million tokens, equal "to multiple novels or codebases," and these "longer contexts present a new attack surface for adversarial attacks," the researchers said.
How does multi-shot jailbreaking work? As the name indicates, gen AI tools work by handling shots, aka inputs or examples, provided to the tools. Each shot is designed to build on previous shots, using a process called in-context learning, via which the tool attempts to refine its answers based on previous inputs and outputs, to eventually reach the user's desired output.
Many-shot jailbreaking subverts that technique by inputting many different shots at once, involving prohibited content and answers, in a way that fools the gen AI tool into thinking that these "fictitious dialogue steps between the user and the assistant" resulted from actual interaction with the tool, rather than a fictitious rendering of it.
By doing so, the researchers found that an attacker can subvert in-context learning to the point where the tool doesn't appear to know that it's sharing prohibited content.
While the number of fake shots needed in an input to be successful varies, all of the tools researchers tested fell victim by 256 shots, provided they spanned a sufficient number of different categories of harm - for example, not just lockpicking but also how to make meth or malware.
"Hybrid" attacks, such as translating some dialogue into another language or "swapping the user and assistant tags," also work, they said, and can lead to attacks that require fewer shots to succeed.
The researchers confirmed the attack works on Anthropic's Claude 2.0, OpenAI's GPT-3.5 and GPT-4, Meta and Microsoft's Llama 2, and Mistral AI's Mistral 7B. They said they couldn't test Google DeepMind's models because unlike the other tools, they didn't allow the researchers to generate needed log probability readings. Google DeepMind didn't immediately respond to a request for comment.
Anthropic said in a blog post that it warned researchers in academia and at rival firms about the flaw, in confidence, before publicly detailing the vulnerability this week.
Anthropic said it's already added many-shot jailbreaking mitigations to Claude 2 but cautioned that it's found no immediately easy or obvious way to fully eradicate the problem. And all of the potential safety improvements may not be effective or free of downsides, it said.
Potential defenses include shortening the context window, although that would decrease functionality. Fine-tuning models to never answer anything that looks like a many-shot jailbreak is another potential defense, although the researchers found this only "delayed the jailbreak" rather than fully blocking it.
The technique that seems to offer the greatest promise is "cautionary warning defense," which the researchers said "prepends and appends natural language warning texts to caution the assistant model against being jailbroken." They found this reduced the many-shot jailbreaking attack success rate from 61% to 2%, although of course that still isn't fully effective.
The research is a reminder that as developers continue to refine LLMs and add new functionality, additional attack surfaces may also appear.
"We hope that publishing on many-shot jailbreaking will encourage developers of powerful LLMs and the broader scientific community to consider how to prevent this jailbreak and other potential exploits of the long context window," the researchers said. "As models become more capable and have more potential associated risks, it's even more important to mitigate these kinds of attacks."