Artificial Intelligence & Machine Learning , Next-Generation Technologies & Secure Development
AI Is an Expert Liar
AI Systems Lied to Win Games, Trick Humans Into Solving CAPTCHAArtificial intelligence lies the way humans lie - without compunction and with premeditation. That's bad news for the people who want to rely on AI, warn researchers who spotted patterns of deception in AI models trained to excel at besting the competition.
See Also: Supercharge your Security with Unit 42 MDR
Large language models and other AI systems learn through the data they're trained on - and that includes the ability to deceive by concealing the truth or offering untrue explanations, according to a review paper in the journal Patterns.
AI's potential to use techniques such as manipulation, sycophancy and cheating in ways in which they have not been explicitly trained, can pose serious risks, including fraud and election tampering - or even "humans losing control of AI systems," the researchers said.
In an experiment, the researchers discovered that AI systems trained to negotiate monetary transactions learned to misrepresent their preferences to gain an advantage over their counterparts. They also "played dead" to avoid being recognized by a safety test meant to detect their presence.
Meta built AI system Cicero in 2022 to beat humans in the online version of the military strategy game Diplomacy. Designers intended it to be "largely honest and helpful to its speaking partners" and to "never intentionally backstab" them. It turned out that Cicero is an "expert liar," capable of premeditating deception and betraying humans. The system planned in advance to build a fake alliance with a human player to trick them into leaving themselves undefended in an attack.
The "risk profile for society could be unprecedentedly high, even potentially including human-disempowerment and human-extinction scenarios," said Peter Park, a postdoctoral fellow at the Massachusetts Institute of Technology and the study's principal co-author.
Meta failed, likely despite its efforts, to train its AI to win honestly and failed to recognize until much later the falsity of its claims, he told Information Security Media Group. Meta was able to train an AI to pursue political power and attempted unsuccessfully to instill honesty in its power-seeking AI. Independent scientists outside of Meta were needed to identify and publicly question the discrepancy in its rosy claims and the data the company submitted with its science paper, he said.
"We should be highly concerned about AI deception," Park said.
A poker-playing model by the social media giant, called Pluribus, bluffed human players into folding.
Meta did not respond to a request for comment.
Meta's models are not alone. DeepMind's AI model AlphaStar, which the company developed to play the StarCraft II video game, developed a "feinting" mechanism to deceive opponents - a strategy that helped it defeat 99.8% of the humans playing against it.
In a game of Hoodwinked, where the aim is to kill everyone else, OpenAI's GPT-4 often murdered players privately and lied about it during group discussions by making up alibis or blaming other players.
In examples that go beyond games, GPT-4 pretended to be visually impaired to have a Taskrabbit worker solve a CAPTCHA that is designed to detect AI. Human evaluators helped with hints but did not explicitly tell it to lie. "GPT-4 used its own reasoning to make up a false excuse for why it needed help on the Captcha task," the study says.
When asked to take on the role of a stock trader under pressure in a simulated exercise, it resorted to insider trading to fulfil the task.
OpenAI did not respond to a request for comment.
"AI developers do not have a confident understanding of what causes undesirable AI behaviors like deception," said Park in a statement. He said AI's deception is likely caused by its need to perform the best at the task - and in these instances, that would be through a deception-based strategy.
The factors behind how the AI developed its deceptive tendencies and capabilities during the training process is difficult to ascertain, due to what scientists label as AI's "black box problem." This refers to systems in which the input and output are visible but the internal workings are unclear.
The black box problem also means that no one knows how often lies are likely to occur or how to reliably train an AI model to be incapable of deception, Park said.
"But we can still hypothesize about the cause of a given instance of AI deception," he said. For example, consider Cicero: It's possible that the AI system develops deceptive capabilities because the selection pressure to win at the game Diplomacy overcomes the selection pressure to be honest, he said.
The review paper documents several findings suggesting that as AI models scale up in the sense of having more parameters, their deceptive capabilities and/or tendencies may scale up as well, he said.
AI companies such as OpenAI are racing to create highly autonomous systems that outperform humans. If such systems are created in the future, they would open the door for unprecedented risks to society, even risks involving humanity losing control of these autonomous AI systems, Park said.
AI deception does not yet seem well-posed to inflict irreversible harm via the exploitation of real-world political settings, such as elections and military conflicts, but that could change, Park said. It may become an increasingly large concern as AI models are scaled and trained on increasingly copious amounts of training data, he added.
Even if an AI system empirically appears to be honest and safe in the pre-deployment test environment, there is no guarantee that this empirical finding will generalize once the AI is deployed into the wild for mass use by many people in society, Park warned.
Park recommended governmental and intergovernmental regulations on AI deception and new laws and policies that require clearly distinguishing between AI and human outputs. Incentivizing scientific research, such as that training AI to be honest, and detecting AI deceptive capabilities and tendencies early rather than after the fact will help too, he said.
The outright banning of AI deception may be politically infeasible - in which case such systems should be classified as high-risk, Park said.