NeurIPS 2024
We introduce GovSim, a generative simulation platform to study strategic interactions and cooperative decision-making in LLMs facing a Tragedy of the Commons. Agents play as villagers sharing a finite resource across monthly rounds of acting, discussing, and reflecting. Most models fail to achieve sustainable equilibrium (< 54% survival rate); agents leveraging moral reasoning achieve significantly better sustainability.
Giorgio Piatti*, Zhijing Jin*, Max Kleiman-Weiner*, Bernhard Schölkopf, Mrinmaya Sachan, Rada Mihalcea
multi-agent LLMssocial dilemmacooperationtragedy of the commonsGovSim
Coming soon
We propose SocialHarmBench, the first comprehensive benchmark to evaluate the vulnerability of LLMs to socially harmful goals with 78,836 prompts from 47 democratic countries collected from 16 genres and 11 domains. These prompts were carefully collected and human-verified by LLM safety experts and political experts. From experiments on 15 cutting-edge LLMs, many safety risks are uncovered.
Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin
LLM safetysociopolitical harmsbenchmarkingdemocracy defensered-teaming
ORAL IASEAI 2026
We introduce HistoricalMisinfo, a curated dataset of 500 historically contested events from 45 countries, each paired with factual and revisionist narratives. To simulate real-world pathways of information dissemination, we design eleven prompt scenarios per event. Evaluating responses from multiple LLMs, we observe vulnerabilities and systematic variation in revisionism across models, countries, and prompt types.
Francesco Ortu, Joeun Yook, Bernhard Schölkopf, Rada Mihalcea, Zhijing Jin
historical revisionismmisinformationfactualityLLM evaluationdemocratic integrity
COLM 2025 Workshop SoLaR Poster
We evaluate how LLMs navigate trade-offs involving the Universal Declaration of Human Rights, leveraging 1,152 synthetically generated scenarios across 24 rights articles in eight languages. Analysis of eleven major LLMs reveals systematic biases: models accept limiting Economic, Social, and Cultural rights more often than Political and Civil rights, with significant cross-linguistic variation.
Keenan Samway, Nicole Miu Takagi, Rada Mihalcea, Bernhard Schölkopf, Ilias Chalkidis, Daniel Hershcovich, Zhijing Jin
human rightsUDHRmultilingual alignmentethical AIvalue bias
Coming soon
We propose a novel methodology to assess LLM alignment on the democracy–authoritarianism spectrum, combining the F-scale psychometric tool, a new favorability metric (FavScore), and role-model probing. LLMs generally favor democratic values but exhibit increased favorability toward authoritarian figures when prompted in Mandarin, and often cite authoritarian figures as role models even outside political contexts.
David Guzman Piedrahita, Irene Strauss, Bernhard Schölkopf, Rada Mihalcea, Zhijing Jin
political biasdemocracy vs authoritarianismmultilingual evaluationAI ethics
Coming soon
We explore multiple directions to investigate hidden mechanisms behind content moderation: training classifiers to reverse-engineer content moderation decisions across countries, and explaining moderation decisions by analyzing Shapley values and LLM-guided explanations. Our experiments reveal interesting patterns in censored posts, both across countries and over time.
Neemesh Yadav, Jiarui Liu, Francesco Ortu, Roya Ensafi, Zhijing Jin, Rada Mihalcea
content moderationexplainabilitycross-country analysiscensorshipNLP ethics