Have you ever wondered how large language models, like the ones driving ChatGPT, excel at tasks such as drafting legal documents or translating various languages? While these cutting-edge machine-learning systems are skilled at interpreting natural language, they often struggle with tasks needing numerical or symbolic reasoning.
For example, while a language model can easily list U.S. presidents and their birthdays, it may falter when asked which presidents elected after 1950 were born on a Wednesday—spoiler alert: it’s Jimmy Carter.
To address these limitations, researchers from MIT and other institutions have introduced a groundbreaking approach that equips large language models to tackle natural language processing, mathematical analysis, and symbolic reasoning by generating computer programs.
This innovative method, termed natural language embedded programs (NLEPs), prompts the language model to create and execute a Python script that resolves the user’s question, providing the solution in clear, natural language.
The results speak for themselves—NLEPs significantly boost accuracy across various reasoning tasks and offer the versatility of being reused for multiple inquiries.
Moreover, NLEPs enhance transparency. Users can examine the generated program to understand how the model derived its answer, allowing them to correct any mistakes easily.
“Our goal is to make AI reasoning both transparent and trustworthy. Though there’s much progress still needed, we believe that marrying programming with natural language capabilities in large language models is a crucial step towards achieving AI that humans can fully understand and trust,” explains Hongyin Luo, PhD ’22, a postdoctoral researcher at MIT and co-lead author of a recent study on NLEPs.
Luo’s co-leads in this research include Tianhua Zhang, a graduate student from the Chinese University of Hong Kong, undergraduate Jiaxin Ge from Peking University, Yoon Kim, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science, and James Glass, a senior research scientist in the Spoken Language Systems Group at CSAIL, along with other collaborators. Their findings are set to be presented at the Annual Conference of the North American Chapter of the Association for Computational Linguistics.
Programmatic Problem Solving
Most large language models function by predicting the next token of text based on natural language inputs. While advanced models like GPT-4 can generate code, this process can lead to inaccuracies resulting from embedding code in natural language.
In contrast, the MIT team approached NLEPs differently, instructing the model to produce detailed Python code first, and integrating natural language elements within the code itself.
An NLEP involves a four-step process: first, the model calls upon relevant packages or functions necessary for the task. Next, it imports natural language representations that contain vital knowledge (for instance, a comprehensive list of U.S. presidents and their birthdays). The third step requires the model to develop a function that computes the answer, followed by outputting the result in natural language, complete with data visualization when appropriate.
“Think of it like a reliable digital calculator—whenever the program is correct, it will yield the right computation,” says Luo.
This method allows users to delve into the generated code to troubleshoot without needing to restart the entire model, enhancing user efficiency. If the same user poses various similar questions, a single core program can often address them by merely swapping out specific variables instead of invoking the model multiple times.
To guide the model in generating an NLEP, researchers provided a comprehensive instruction, accompanied by two NLEP examples (demonstrating both math and natural language scenarios) and a test question.
“Normally, when using few-shot prompting, specific prompts must be crafted for each task. However, we found a single prompt could be applied across diverse tasks by teaching the model to generate programs rather than just respond to individual problems,” Luo adds.
“Empowering language models with coding capabilities opens a multitude of possibilities for tool usage, validating outputs, and fostering more structured insight into these models’ functionalities,” states Leonid Karlinsky, principal scientist at the MIT-IBM Watson AI Lab.
The Power of NLEPs
NLEPs showcased an impressive over 90 percent accuracy when guiding GPT-4 in various symbolic reasoning tasks like sorting shuffled objects or playing strategic games like 24, in addition to text classification and instruction-following tasks. Notably, this method eclipsed traditional task-specific prompting techniques by 30 percent and outperformed many open-source LLMs.
By boosting the performance of large language models, NLEPs could also enhance user data privacy. With programs executed locally, sensitive data won’t be transmitted to major corporations like OpenAI or Google for processing, safeguarding user information.
Additionally, NLEPs may allow smaller language models to perform at higher levels without incurring the costs of re-training for specific tasks.
“There’s no hidden secret here. We’re not employing a more advanced or expensive language model; we’re simply leveraging program generation instead of natural language generation, resulting in substantial performance improvements,” Luo emphasizes.
However, it’s important to note that the effectiveness of NLEPs hinges on the model’s ability to generate code, meaning they may not work as well with smaller, less-trained language models. Future research aims to explore methods to enhance smaller models’ program generation capabilities and examine how different prompting variations can strengthen the robustness of AI reasoning.
This study received support, in part, from the Center for Perceptual and Interactive Intelligence of Hong Kong.
Photo credit & article inspired by: Massachusetts Institute of Technology