Safeguarding the future: Evaluating sabotage risks in powerful AI systems
As artificial intelligence (AI) systems become increasingly capable, the stakes for ensuring their safety have never been higher. With AI being deployed in high-stakes domains—from healthcare to infrastructure to governance—any failure to address hidden risks could lead to catastrophic consequences.
A new research paper led by David Duvenaud with contributions from Roger Grosse, “Sabotage Evaluations for Frontier Models,” tackles one of the most pressing challenges in AI safety: how to evaluate and mitigate the ability of advanced AI models to subvert oversight, mask harmful behaviors, or undermine critical decision-making processes.
By exploring “sabotage capabilities”—a previously under-examined dimension of AI risk—the paper provides an initial framework for understanding and managing some threats emerging AI technologies might pose. The study’s findings offer valuable insights for researchers, policymakers, and organizations committed to the responsible development of powerful AI systems.
Advancing AI safety with rigorous evaluations
Authored by Duvenaud and Grosse with members of Anthropic’s Alignment Science Team, “Sabotage Evaluations for Frontier Models” outlines a method for evaluating potential risks from sabotage generated by the models themselves.
The paper focuses on four different types of potential capabilities or actions that advanced AI models could take to compromise safety measures or decision-making within organizations:
Human decision sabotage: Can the model steer humans toward bad decisions without appearing suspicious?
Code sabotage: Can the model insert subtle bugs into codebases over time without detection?
Sandbagging: Can the model hide dangerous capabilities during testing but reveal them later?
Undermining oversight: Can the model subtly manipulate evaluation or monitoring systems?
These thresholds were tested by the paper’s authors on Anthropic’s Claude 3 Opus and Claude 3.5 Sonnet models, providing insights into the models’ limitations and risks. The paper’s findings demonstrate that while current mitigations may suffice for existing AI systems, more robust measures will likely be necessary as the capabilities of such systems continue to advance.
Contributions from SRI experts
University of Toronto researchers David Duvenaud and Roger Grosse, associate professors in the Department of Computer Science and Schwartz Reisman Chairs in Technology and Society, played pivotal roles in developing the evaluation framework developed in their paper.
Duvenaud and Grosse are globally renowned experts in AI safety—a field of research dedicated to ensuring AI systems behave safely and reliably, with minimal risk of unintended consequences or harm to humanity. In addition to their academic roles at U of T, both researchers have worked extensively with Anthropic’s Alignment Science Team to develop new tools, evaluations, and policies to support AI safety initiatives.
“These evaluations arose as part of an effort to meet Anthropic's Responsible Scaling Policy commitments—specifically, to provide evidence that evaluations or other operations couldn’t have been disrupted by the models themselves,” says Duvenaud. “This is something that most people are pretty sure isn't a serious concern with the current generation of models, but at the same time, it's hard to produce definitive evidence that something isn't possible.”
As lead of Anthropic’s Alignment Evaluation team during the period in which the study was conducted, Duvenaud and his collaborators proposed model evaluation processes, including simulating large-scale deployments to predict risks that might only occur at greater scales.
Looking ahead to the future
The paper’s findings reveal several critical insights to help inform future directions for AI safety research. Chief among these is the notion that while, in the context of current models, sabotage risks from AI systems currently appear limited under basic oversight mechanisms, the development of systems with more advanced capabilities could render such forms of basic mitigations insufficient.
The paper also underscores the importance of mitigation-aware evaluations, which test AI model capabilities under realistic conditions, including built-in safeguards. Additionally, it demonstrates how small-scale tests can effectively simulate long-term deployment scenarios, providing a valuable approach for anticipating and mitigating future risks as AI systems advance.
As model capabilities continue to develop, many leading figures in the AI community have voiced concerns about proactively assessing and mitigating potential risks from advanced systems. In a 2023 lecture hosted by the Schwartz Reisman Institute, U of T University Professor Emeritus and Nobel Prize winner Geoffrey Hinton contended that intelligent AI systems would eventually come to dominate over human interests if left to develop at their current pace without clear guardrails.
As tech companies continue to develop powerful AI systems in pursuit of new services and applications that leverage the potential of large language models and generative AI tools, the question of how to best utilize responsible scaling protocols—and the limits of such approaches—is crucial. The work of Duvenaud and Grosse represents a necessary commitment for AI developers to adopt more robust evaluation practices, and to anticipate the challenges posed by increasingly capable systems. As part of Schwartz Reisman Institute’s mission to shape the social impacts of AI systems, Duvenaud and Grosse’s research provides an important contribution towards developing dialogues that can address gaps in governance at the frontier of AI development.
“I was pleasantly surprised that we were able to come up with any sort of answer at all to the question of scaling to large deployments,” says Duvenaud. However, he notes that potential harms from advanced AI systems that might occur within broader social contexts—such as job losses, changes to people’s social behaviours, or increased surveillance—fall outside the risk analysis explored by the study. “Doing this work highlighted the many slow-rolling major changes that are likely to occur that don't fit under an acute catastrophic risk framework… As long as these happen gradually and legally, they don’t fit under the umbrella of risks that most companies’ policies commit them to preventing.”
“To be fair, it's not clear how companies could address such risks,” observes Duvenaud. “But I think it's not widely appreciated how narrow the formal safety efforts of these companies are compared to the changes that they expect.”
Explore the full study and its implications on Anthropic’s website.