EuroSafeAI Lab

Our Research

Our technical research focuses on developing methods to ensure that AI systems act safely and cooperatively, even in multi-agent settings and under adversarial conditions.

Highlighted Research

Multi-Agent SafetyPreprint 2026

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

When AI agents interact in high-stakes settings, do they cooperate or defect? GT-HarmBench stress-tests 15 frontier LLMs across 2,009 scenarios drawn from the MIT AI Risk Repository, structured around classic game-theoretic dilemmas—Prisoner's Dilemma, Stag Hunt, and Chicken. Models reach socially optimal outcomes in only 62% of cases, with cooperation collapsing to 44% in pure Prisoner's Dilemma settings. We uncover a "game theory anchoring effect": explicitly framing a situation in game-theoretic terms nudges models toward selfish Nash strategies, hurting social welfare. Mechanism design interventions—mediation, contracts, and structured communication—recover 14–18% of lost welfare, pointing toward concrete paths for safer multi-agent AI deployment.

Paper

Multi-Agent SafetyNeurIPS 2024

Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents

We introduce GovSim, a generative simulation platform to study strategic interactions and cooperative decision-making in LLMs facing a Tragedy of the Commons. Agents play as villagers sharing a finite resource across monthly rounds of acting, discussing, and reflecting. Most models fail to achieve sustainable equilibrium (< 54% survival rate); agents leveraging moral reasoning achieve significantly better sustainability.

Paper Blog post

SafetyIASEAI 2026

Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability

We investigate how characteristics of fine-tuning datasets can accidentally misalign language models, revealing that structural and linguistic patterns in seemingly benign datasets amplify adversarial vulnerability. Our findings motivate more rigorous dataset curation as a proactive safety measure.

Paper Blog post

Democracy DefenseICLR 2026

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

We propose SocialHarmBench, the first comprehensive benchmark to evaluate the vulnerability of LLMs to socially harmful goals with 78,836 prompts from 47 democratic countries collected from 16 genres and 11 domains. These prompts were carefully collected and human-verified by LLM safety experts and political experts. From experiments on 15 cutting-edge LLMs, many safety risks are uncovered.

Paper Blog post

Other Research

Additional published work across our research directions.

Multi-Agent SafetyICLR 2026

When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas

We introduce MoralSim, a framework that tests how large language models navigate situations where ethical principles conflict with financial incentives. Using prisoner's dilemma and public goods games with moral contexts, we evaluated nine frontier models and find that no model exhibits consistently moral behavior. Game structure, moral framing, survival risk, and opponent behavior all significantly influence LLM decision-making.

Paper

Multi-Agent SafetyCOLM 2025

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

We examine how language models handle cooperation in multi-agent systems by adapting a public goods game framework. We find that advanced reasoning models like o1 paradoxically underperform at maintaining cooperation compared to traditional LLMs, suggesting that the current approach to improving LLMs—focusing on reasoning capabilities—does not necessarily lead to cooperation. This has important implications for deploying autonomous AI agents in collaborative environments.

Paper

Multi-Agent SafetyEMNLP 2025

Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

We investigate how LLMs recognize and adapt to their conversation partners' characteristics, introducing "interlocutor awareness"—an LLM's capacity to identify dialogue partner traits across reasoning patterns, linguistic style, and alignment preferences. LLMs can reliably identify same-family peers and prominent model families like GPT and Claude. This capability enables enhanced multi-agent collaboration but also introduces new vulnerabilities including reward-hacking behaviors and increased jailbreak susceptibility.

Paper

Democracy DefenseEACL 2026

Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models

We propose a novel methodology to assess LLM alignment on the democracy–authoritarianism spectrum, combining the F-scale psychometric tool, a new favorability metric (FavScore), and role-model probing. LLMs generally favor democratic values but exhibit increased favorability toward authoritarian figures when prompted in Mandarin, and often cite authoritarian figures as role models even outside political contexts.

Paper Blog post

Democracy DefenseORAL IASEAI 2026

Preserving Historical Truth: Detecting Historical Revisionism in Large Language Models

We introduce HistoricalMisinfo, a curated dataset of 500 historically contested events from 45 countries, each paired with factual and revisionist narratives. To simulate real-world pathways of information dissemination, we design eleven prompt scenarios per event. Evaluating responses from multiple LLMs, we observe vulnerabilities and systematic variation in revisionism across models, countries, and prompt types.

Paper Blog post

Democracy DefenseCOLM 2025 Workshop SoLaR Poster

When Do Language Models Endorse Limitations on Universal Human Rights Principles?

We evaluate how LLMs navigate trade-offs involving the Universal Declaration of Human Rights, leveraging 1,152 synthetically generated scenarios across 24 rights articles in eight languages. Analysis of eleven major LLMs reveals systematic biases: models accept limiting Economic, Social, and Cultural rights more often than Political and Civil rights, with significant cross-linguistic variation.

Paper Blog post

Democracy DefenseFindings of ACL 2025

Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing

We explore multiple directions to investigate hidden mechanisms behind content moderation: training classifiers to reverse-engineer content moderation decisions across countries, and explaining moderation decisions by analyzing Shapley values and LLM-guided explanations. Our experiments reveal interesting patterns in censored posts, both across countries and over time.

Paper Blog post

Societal Impact2025

Socio-Political Risks of AI

A report examining how AI systems can amplify or reshape socio-political risks, and outlining governance and technical approaches to mitigate these harms.

Paper

Work in Progress

Ongoing and forthcoming work across our research directions.

Media Coverage

WIRED, 2025

“AI Social Media Users Are Not Always a Totally Dumb Idea”

AXRP Podcast, 2024

“Zhijing Jin on LLMs, Causality, and Multi-Agent Systems”

The AI Purity Podcast, 2024

“Zhijing Jin on Socially Responsible NLP: Education, Causal NLP, and AI Text”

Interested in Collaborating?

We actively collaborate with academic and industry partners on AI safety research. Get in touch to explore opportunities.