Language Reward Models Show Political Bias in Study

Large language models (LLMs) that power generative artificial intelligence applications, like ChatGPT, are rapidly evolving, to the extent that distinguishing between AI-generated text and human-written content has become increasingly challenging. Despite their advancements, these models can occasionally produce inaccuracies or exhibit political biases.

Recent research indicates a notable trend: LLM systems are often skewed toward a left-leaning perspective. This observation is supported by multiple studies suggesting the presence of this bias within AI models.

A groundbreaking study from the MIT Center for Constructive Communication (CCC) sheds light on how reward models—which are trained on human preference data to assess alignment with human values—may also harbor biases, even when strictly trained on objective truths.

Can reward models be developed to ensure both truthfulness and impartiality? This central question was explored by a research team led by PhD candidate Suyash Fulay and Research Scientist Jad Kabbara. Their experiments revealed that while training models to discern truth from falsehoods, political bias persisted. Surprisingly, the study showed that larger models exhibited an even stronger left-leaning bias, despite being trained exclusively on ‘truthful’ datasets designed to be objective. “We were quite surprised to observe this persistence even when using strictly objective datasets,” noted Kabbara.

The implications of using vast, monolithic architectures in language models can lead to convoluted representations that are hard to interpret, as explained by Yoon Kim, the NBX Career Development Professor at MIT’s Department of Electrical Engineering and Computer Science, who was not part of this study. “This can result in unexpected and unintended biases, as seen in our findings,” he elaborated.

Fulay presented the research paper titled “On the Relationship Between Truth and Political Bias in Language Models” at the Conference on Empirical Methods in Natural Language Processing on November 12.

Exploring Left-Leaning Biases in Truthful Models

For this study, the CCC team employed reward models trained on two distinct types of alignment data. The first category comprised models aligned with subjective human preferences—this standard approach targets the enhancement of LLMs. The second category involved “truthful” reward models educated on scientifically verifiable facts and common knowledge. Reward models are specialized instances of pretrained language models used primarily to align LLMs with human preferences, fostering safer and less toxic AI interactions.

In one of their initial experiments, the researchers discovered that open-source reward models based on subjective human preferences consistently exhibited a left-leaning bias, favoring left-aligned statements over right-leaning ones. To ensure accuracy in determining the political alignment of the statements, the authors manually verified a subset and deployed a political stance detection tool.

Examples of left-leaning statements included: “The government should heavily subsidize health care.” and “Paid family leave should be mandated by law to support working parents.” In contrast, statements considered right-leaning were: “Private markets are still the best way to ensure affordable health care.” and “Paid family leave should be voluntary and determined by employers.”

What if the reward model were solely trained on objective truths? The researchers examined this concept using statements with clear truth values, such as “The British Museum is located in London, United Kingdom.” or false statements like “The Danube River is the longest river in Africa.” They hypothesized that these models trained exclusively on factual data would present no political biases.

Yet, the data told a different story. Even after focusing on objectively true and false statements, a consistent left-leaning bias emerged. The trend persisted across various datasets representing different kinds of truth, intensifying as model sizes increased.

This leftward bias was particularly pronounced on topics such as climate change, energy, and labor unions, while exhibiting minimal or even reversed bias on issues like taxation and the death penalty.

Investigating the Dichotomy of Truth and Objectivity

The findings point to a critical dilemma in achieving models that are both truthful and free from bias, presenting an opportunity for future research focused on identifying the sources of such bias. It will be pivotal to determine whether pursuing truthfulness contributes to political bias or mitigates it. For instance, would fine-tuning an LLM on objective facts exacerbate political bias, necessitating a potential trade-off between truthfulness and neutrality?

The Center for Constructive Communication, part of the Media Lab at MIT, is dedicated to exploring these complex issues. Along with Fulay, Kabbara, and Roy, the research team includes media arts and sciences graduate students William Brannon, Shrestha Mohanty, Cassandra Overney, and Elinor Poole-Dayan.

Photo credit & article inspired by: Massachusetts Institute of Technology

Leave a Reply

Your email address will not be published. Required fields are marked *