Large language models (LLMs) demonstrate impressive capabilities, yet they are not infallible. One significant issue is their tendency to “hallucinate,” meaning they can produce inaccurate or unsupported information in response to user queries.
This hallucination phenomenon poses considerable challenges, particularly in high-stakes fields such as healthcare and finance, where LLM outputs must be carefully validated by human fact-checkers. Unfortunately, the current verification process often involves tedious reviews of lengthy documents cited by LLMs, a practice that can be cumbersome and error-prone, deterring potential users from leveraging generative AI technology.
To address this challenge, researchers at MIT have developed a novel tool called SymGen, which simplifies and expedites the verification of LLM-generated responses. With SymGen, the model generates responses accompanied by precise citations that direct users to the relevant sections of source documents, such as specific cells in a database.
Users can hover over highlighted sections of the generated text to see the underlying data that informed that specific word or phrase. Meanwhile, unhighlighted segments indicate areas that may require further scrutiny to ensure validity.
“SymGen empowers users to concentrate on the parts of the text that warrant more careful examination. Ultimately, it enhances users’ confidence in a model’s outputs by streamlining the verification process,” explains Shannon Shen, a co-lead author and graduate student in electrical engineering and computer science.
According to a user study, utilizing SymGen improved verification speeds by approximately 20% compared to traditional methods. This enhancement not only facilitates quicker validation but may also aid in identifying errors in various real-world applications, from crafting clinical notes to summarizing complex financial reports.
Shen’s co-authors on the study include fellow EECS graduate students Lucas Torroba Hennigen and Aniruddha “Ani” Nrusimha, Bernhard Gapp, president of the Good Data Initiative, and senior authors David Sontag, a professor of EECS and member of the MIT Jameel Clinic, alongside Yoon Kim, an assistant professor of EECS and CSAIL member. The findings were presented recently at the Conference on Language Modeling.
Enhancing Verification with Symbolic References
Many existing LLMs are engineered to produce citations that link to external documents. However, validation systems often overlook the user’s time and the effort required to sift through multiple citations, as noted by Shen.
“Generative AI aims to make tasks more efficient, but if it results in extensive document reviews just to verify the accuracy of the information, its utility diminishes,” he remarks.
The MIT researchers took a human-centered approach to solving the validation dilemma. A user first provides the LLM with reference data, such as a table of basketball statistics. Instead of directly completing a task like summarizing a game, the model is prompted to generate its response in a symbolic format during an intermediate step.
In this symbolic generation, the model must explicitly cite the exact cell from the data table that corresponds to the information being referenced. For example, when mentioning “Portland Trailblazers,” the model replaces it with the cell name in the data table containing that term.
“This intermediate step allows for precise, fine-grained references, connecting every fragment of the output exactly to its source,” Torroba Hennigen explains.
SymGen utilizes a rule-based tool to resolve these references, ensuring that quoted text is verbatim, thereby guaranteeing accuracy for the corresponding data.
Streamlining Validation Processes
One of SymGen’s strengths lies in its training methodology. LLMs are often trained on large datasets from the internet, which includes placeholders for variables. By prompting the model to produce symbolic responses, SymGen utilizes a similar structure to ensure the accuracy of the generated outputs.
“Our prompt design leverages the LLM’s inherent capabilities,” adds Shen.
Participants in user studies reported that SymGen made validating LLM-generated content significantly easier and faster — about 20% quicker than conventional methods.
However, it’s important to note the system’s limitations; the accuracy of SymGen is inherently linked to the quality of the source data. An LLM may reference an incorrect variable, leaving a human verifier unaware of the error.
Additionally, users must provide structured data, like tables, for SymGen to function effectively. At present, the system is optimized for tabular data alone.
Looking ahead, the MIT researchers aim to enhance SymGen’s capabilities to manage various text and data formats. Such improvements could facilitate the validation of AI-generated legal document summaries and could even be tested with healthcare professionals to understand its potential in verifying clinical summaries generated by AI.
This innovative work is partially funded by Liberty Mutual and the MIT Quest for Intelligence Initiative.
Photo credit & article inspired by: Massachusetts Institute of Technology