Large Language Models (LLMs) possess remarkable versatility, capable of handling a variety of tasks—from assisting students in composing emails to aiding medical professionals in cancer diagnosis. However, this broad applicability also complicates the systematic evaluation of these models. Crafting a benchmark dataset to assess every possible query is practically unfeasible.
In a recent research paper, MIT scientists propose an innovative approach to evaluation. They contend that understanding how humans perceive the capabilities of LLMs is critical since it’s users who determine when to utilize these models.
For instance, a graduate student must assess whether an LLM can effectively draft a specific email, while a clinician may ponder the best cases to consult the model. Building upon this premise, the researchers introduced a framework for evaluating LLMs based on their alignment with human expectations regarding task performance.
They developed a Human Generalization Function—a theoretical model illustrating how people adjust their beliefs about an LLM’s capabilities after interacting with it. The researchers then assessed how aligned LLMs are with this Human Generalization Function.
Findings revealed that misalignment between LLMs and the Human Generalization Function could lead users to become either overly confident or too cautious in deploying the model, potentially resulting in unexpected failures. Moreover, in critical scenarios, higher-performing models may actually exhibit poorer performance compared to their smaller counterparts due to this misalignment.
“These tools are promising because they are adaptable; however, their general applicability means we must consider the human factor in their utilization,” stated co-author Ashesh Rambachan, an assistant professor of economics and principal investigator at the Laboratory for Information and Decision Systems (LIDS).
Joining Rambachan in this study are lead author Keyon Vafa, a postdoctoral researcher at Harvard University, and Sendhil Mullainathan, an MIT professor in Electrical Engineering, Computer Science, and Economics, also a member of LIDS. This research will be presented at the International Conference on Machine Learning.
The Concept of Human Generalization
We naturally form beliefs about other individuals based on our interactions with them. For example, if a friend is known for meticulously correcting grammar, you might infer they could also excel in crafting sentences, despite never having posed direct questions about sentence composition.
“LLMs often exhibit a human-like quality. Our aim was to demonstrate that the phenomenon of human generalization similarly influences how individuals form perceptions about language models,” explained Rambachan.
The team established a formal definition for the Human Generalization Function, which involves querying, observing responses from individuals or LLMs, and deducing how they would likely respond to related inquiries.
For instance, if a person observes an LLM answering questions on matrix inversion correctly, they may wrongly assume that it will also handle basic arithmetic flawlessly. A model misaligned with this function—one that fails to deliver on expectations—could lead to performance issues upon deployment.
With this groundwork in place, the researchers crafted a survey to quantify how individuals generalize based on their experiences with LLMs and other people.
The survey presented participants with various scenarios—some involving correct and incorrect responses from either humans or LLMs—and assessed their predictions about related question performance. The outcome was a dataset containing approximately 19,000 examples of human generalization regarding LLM performance across 79 varied tasks.
Assessing Misalignment
Results showed that while participants effectively predicted human performance based on previous correct answers, they struggled to apply the same reasoning to LLMs.
“Humans apply their generalization instincts to language models, but this often fails because LLMs do not display expertise patterns akin to human interactions,” remarked Rambachan.
Participants were also more inclined to alter their beliefs about an LLM following incorrect responses as opposed to correct answers. Furthermore, there was a prevalent belief that performance on simpler tasks had little bearing on outcomes for more complex ones.
In scenarios where incorrect responses carried more weight, simpler models outperformed advanced ones like GPT-4.
“Highly capable language models may inadvertently lead users to overestimate performance on related tasks, despite failing in practice,” he noted.
This challenge may stem from our relative unfamiliarity with LLMs—people have far less experience interacting with them compared to other individuals.
“In time, as we engage with language models more frequently, we may improve our generalization capabilities,” he posited.
To further this understanding, the researchers aim to conduct additional studies examining how user perceptions of LLMs evolve through interactions. They also seek to integrate their findings on Human Generalization into LLM development practices.
“When developing these algorithms, incorporating human generalization into performance assessment is crucial,” he asserted.
Meanwhile, the researchers hope their dataset serves as a benchmark to evaluate LLM performance in context with the Human Generalization Function, potentially enhancing deployment effectiveness in real-world applications.
“This paper makes two significant contributions. First, it identifies a key challenge for general consumer use of LLMs: if users misunderstand when these models will be accurate, they may become frustrated and less likely to utilize them. This underscores the importance of aligning models with user expectations about their generalization capabilities,” explained Alex Imas, a professor of behavioral science and economics at the University of Chicago’s Booth School of Business, who was not involved in the study. “Second, understanding the limitations of generalization helps clarify LLM functionality when they correctly solve problems, providing a clearer picture of their operational mechanics.”
This research received support from the Harvard Data Science Initiative and the Center for Applied AI at the University of Chicago Booth School of Business.
Photo credit & article inspired by: Massachusetts Institute of Technology