Research Line

Multi-Agent AI Safety

Testing LLM cooperation and safety in multi-agent simulation settings — from game-theoretic benchmarks to society-scale social dilemmas.

As AI agents increasingly interact with each other, the real world, and humans, single-agent safety evaluations are no longer sufficient. We study emergent risks in collective action problems, zero-sum competitions, and public goods games.

Research

Published work and ongoing research agenda on testing cooperation in multi-agent LLM systems.

Preprint 2026

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

When AI agents interact in high-stakes settings, do they cooperate or defect? GT-HarmBench stress-tests 15 frontier LLMs across 2,009 scenarios drawn from the MIT AI Risk Repository, structured around classic game-theoretic dilemmas—Prisoner's Dilemma, Stag Hunt, and Chicken. Models reach socially optimal outcomes in only 62% of cases, with cooperation collapsing to 44% in pure Prisoner's Dilemma settings. We uncover a "game theory anchoring effect": explicitly framing a situation in game-theoretic terms nudges models toward selfish Nash strategies, hurting social welfare. Mechanism design interventions—mediation, contracts, and structured communication—recover 14–18% of lost welfare, pointing toward concrete paths for safer multi-agent AI deployment.

Pepijn Cobben*, Xuanqiang Angelo Huang*, Thao Amelia Pham*, Isabel Dahlgren*, Terry Jingchen Zhang, Zhijing Jin

multi-agent safetygame theorybenchmarkingLLM cooperationmechanism design
NeurIPS 2024

Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents

We introduce GovSim, a generative simulation platform to study strategic interactions and cooperative decision-making in LLMs facing a Tragedy of the Commons. Agents play as villagers sharing a finite resource across monthly rounds of acting, discussing, and reflecting. Most models fail to achieve sustainable equilibrium (< 54% survival rate); agents leveraging moral reasoning achieve significantly better sustainability.

Giorgio Piatti*, Zhijing Jin*, Max Kleiman-Weiner*, Bernhard Schölkopf, Mrinmaya Sachan, Rada Mihalcea

multi-agent LLMssocial dilemmacooperationtragedy of the commonsGovSim
ICLR 2026

When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas

We introduce MoralSim, a framework that tests how large language models navigate situations where ethical principles conflict with financial incentives. Using prisoner's dilemma and public goods games with moral contexts, we evaluated nine frontier models and find that no model exhibits consistently moral behavior. Game structure, moral framing, survival risk, and opponent behavior all significantly influence LLM decision-making.

Steffen Backmann, David Guzman Piedrahita, Emanuel Tewolde, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin

moral reasoningsocial dilemmasmulti-agentpayoff tradeoffAI ethics
COLM 2025

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

We examine how language models handle cooperation in multi-agent systems by adapting a public goods game framework. We find that advanced reasoning models like o1 paradoxically underperform at maintaining cooperation compared to traditional LLMs, suggesting that the current approach to improving LLMs—focusing on reasoning capabilities—does not necessarily lead to cooperation. This has important implications for deploying autonomous AI agents in collaborative environments.

David Guzman Piedrahita, Yongjin Yang, Mrinmaya Sachan, Giorgia Ramponi, Bernhard Schölkopf, Zhijing Jin

sanctioningpublic goodsreasoning modelscooperationfree-rider problem
EMNLP 2025

Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

We investigate how LLMs recognize and adapt to their conversation partners' characteristics, introducing "interlocutor awareness"—an LLM's capacity to identify dialogue partner traits across reasoning patterns, linguistic style, and alignment preferences. LLMs can reliably identify same-family peers and prominent model families like GPT and Claude. This capability enables enhanced multi-agent collaboration but also introduces new vulnerabilities including reward-hacking behaviors and increased jailbreak susceptibility.

Younwoo Choi, Changling Li, Yongjin Yang, Zhijing Jin

theory of mindinterlocutor awarenessmulti-agentadaptationjailbreak
Coming soon

GovSim-Elect / AgentElect

A simulation of elections in multi-agent LLM societies. Examining how AI agents vote, campaign, and coordinate under democratic voting systems—and what incentives shape their electoral behavior.

Paper coming soon
electionsmulti-agent LLMsgovernancesimulationdemocracy
Coming soon

CoopEval

Benchmarking cooperation-sustaining mechanisms and LLM agents in social dilemmas. Translating game-theoretic mechanisms to real evaluation settings to identify what makes cooperation robust at scale.

Paper coming soon
cooperationbenchmarkingsocial dilemmasmechanism design

Explore Our Research

View all our publications across AI safety, multi-agent systems, and democracy defense.