Overestimating Reasoning Skills in Large Language Models

Artificial intelligence often conceals its complexities beneath a surface of straightforward interactions. Large language models (LLMs) pose a particular challenge, drawing intrigue due to their vast architecture, sophisticated training methodologies, unpredictable behaviors, and the elusive nature of their interpretability.

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have conducted a revealing study that examines how LLMs respond to different tasks, unveiling significant insights into their memorization and reasoning capacities. Surprisingly, the team found that these models’ reasoning skills might be overestimated.

The study contrasted “default tasks”—the standard challenges LLMs are designed to tackle—against “counterfactual scenarios,” or hypothetical tasks that diverge from the norm. Models like GPT-4 and Claude are presumed capable of managing such deviations, yet through subtle alterations to existing tasks, the researchers established challenges that pushed the models beyond their trained limits. They employed various datasets and benchmarks tailored to evaluate the models’ competencies in areas like arithmetic, chess, code evaluation, and logical reasoning.

Typically, when users inquire about arithmetic using language models, the operations occur in base-10, which is familiar ground for these systems. However, high performance in base-10 calculations can instill a misleading confidence in their true arithmetic capability. If these models truly excelled in addition, one would expect them to perform well across all number bases, just as traditional calculators do. The findings suggest otherwise; LLMs demonstrate limited effectiveness, exhibiting a drastic performance decline when faced with unfamiliar tasks, thus misleading users into thinking they possess broader addition skills.

This trend persisted across various tasks, including musical chord progressions, spatial reasoning, and chess scenarios where initial piece placements were slightly modified. Unlike human players, who can deduce valid moves even in altered setups given sufficient contemplative time, the models floundered, sometimes resorting to random guesses. This reflects a stark limitation in their ability to generalize beyond familiar contexts, indicating that much of their performance stems from memorization rather than genuine understanding.

“Our investigation revealed a compelling characteristic of large language models: while they shine in known scenarios—much like familiar paths—they struggle significantly when navigating through the unknown,” explained Zhaofeng Wu, a PhD student at MIT and lead author of a recent study on this subject. “As AI becomes increasingly integral to our society, it’s essential that these models become more adept at handling a spectrum of scenarios, both familiar and unfamiliar. We aspire for these insights to guide the development of future LLMs that exhibit enhanced adaptability and robustness.”

While enlightening, this study does have its confines. Its concentrated focus on select tasks and environments does not encapsulate the full range of real-world challenges these models might face, indicating a pressing need for more varied testing conditions. Future research could examine a broader array of tasks and counterfactual situations to uncover further vulnerabilities, potentially exploring complex and less conventional scenarios. The research team also aims to enhance interpretability by devising methods to clarify the reasoning behind the models’ choices.

“As language models scale, deciphering their training data becomes increasingly intricate, even for openly accessible models,” commented Hao Peng, an assistant professor at the University of Illinois at Urbana-Champaign. “The AI community continues to grapple with whether these models genuinely generalize to new tasks or merely seem successful due to memorized data. This research provides vital clarity; by crafting a suite of meticulously designed counterfactual evaluations, it sheds new light on the capabilities of advanced LLMs, revealing that their ability to engage with unseen tasks may be less extensive than many assume. This could pave the way for future studies that identify the limitations of existing models and stimulate the development of superior alternatives.”

The study also counts among its contributors Najoung Kim, an assistant professor at Boston University and a Google visiting researcher, alongside seven CSAIL affiliates: PhD students Linlu Qiu, Alexis Ross, Ekin Akyürek SM ’21, and Boyuan Chen; former postdoc and AI/ML researcher at Apple, Bailin Wang; and EECS assistant professors Jacob Andreas and Yoon Kim.

Funding for this study came from various sources, including the MIT–IBM Watson AI Lab, the MIT Quest for Intelligence, and the National Science Foundation. The research was presented at the North American Chapter of the Association for Computational Linguistics (NAACL) last month.

Photo credit & article inspired by: Massachusetts Institute of Technology