Large language models (LLMs) are transforming how we perform various tasks, from translating text to detecting financial fraud. However, even with their remarkable capabilities, these models occasionally produce inaccurate outputs.
Compounding this issue, LLMs may exhibit overconfidence in incorrect predictions or be too uncertain about correct ones. This unpredictability makes it challenging for users to determine when to trust a model’s response.
Researchers usually calibrate machine learning models to synchronize their confidence levels with their accuracy. A well-calibrated model will express less certainty in wrong predictions and more certainty in correct ones. Yet, traditional methods of calibration struggle with the many diverse applications of large language models.
To address this challenge, a team from MIT and the MIT-IBM Watson AI Lab has developed a novel calibration method specifically for LLMs, named Thermometer. This innovative technique creates a smaller auxiliary model that operates on top of the large language model to enhance its calibration.
Thermometer stands out due to its efficiency; it requires less computational power while maintaining the model’s accuracy and ensuring better-calibrated responses even for tasks it hasn’t encountered before.
By improving the calibration of LLMs across various tasks, Thermometer helps users identify instances where a model may be overly confident in its incorrect predictions, thus preventing potential failures in critical applications.
“With Thermometer, we aim to give users a clear indication of a model’s reliability, providing insight into whether a response is accurate or not, based on the model’s uncertainty,” explains Maohao Shen, an electrical engineering and computer science (EECS) graduate student and lead author of a study on Thermometer.
In this research paper, Shen collaborates with Gregory Wornell, the Sumitomo Professor of Engineering and director of the Signals, Information, and Algorithms Laboratory at MIT; senior author Soumya Ghosh, a research staff member at the MIT-IBM Watson AI Lab; along with other contributors from both MIT and the MIT-IBM Watson AI Lab. Their findings were recently revealed at the International Conference on Machine Learning.
Universal Calibration for LLMs
Unlike traditional machine-learning models that are generally designed for a single task, calibrating them typically involves task-specific methods. In contrast, LLMs are versatile enough to handle multiple tasks; applying conventional calibration methods to one task may hinder performance on another.
Calibrating an LLM often necessitates multiple predictions from the model, which are then aggregated for improved confidence measures. However, with billions of parameters at work, the computational costs of this approach can become prohibitive.
The team developed Thermometer as a flexible solution that employs a classical calibration technique known as temperature scaling to efficiently adjust an LLM’s calibration for new tasks.
In this scenario, “temperature” refers to a scaling parameter used to align the model’s confidence with its prediction accuracy. Traditionally, determining the appropriate temperature involves using a labeled validation dataset of specific examples.
However, acquiring labeled datasets can be challenging, especially for novel applications. For instance, a business might want to use an LLM to handle customer inquiries about a new product but lack a dataset containing those specific questions and answers.
The researchers circumvent this hurdle by training an auxiliary model that sits above the LLM, which automatically predicts the temperature required for proper calibration for new tasks.
This auxiliary model is trained using labeled datasets from a limited selection of representative tasks, allowing it to generalize to new tasks in a related category without requiring additional labeled data.
For example, a Thermometer model trained on datasets containing multiple-choice questions from various subjects, like algebra and medicine, could effectively calibrate an LLM intended to answer questions about geometry or biology.
“Our ultimate goal is for it to be effective across any task, although we haven’t quite reached that level yet,” notes Ghosh.
The Thermometer model requires only a small portion of the LLM’s inner workings to predict the correct temperature for calibrating its predictions for specific tasks.
Efficiency at Its Core
The technique does not demand multiple training iterations and only minimally impacts the LLM’s speed. Additionally, since temperature scaling does not change the model’s predictions, Thermometer preserves its accuracy.
Comparative studies show that Thermometer consistently yields better-calibrated uncertainty measures across multiple tasks while requiring significantly less computational resources.
“Provided we train a Thermometer model on a sufficiently diverse set of tasks, it should effectively generalize to any new task, like a large language model that is also a universal model,” adds Shen.
The team also discovered that training a Thermometer model for a smaller LLM allows it to be utilized directly for calibrating larger LLMs within the same family.
Looking ahead, the researchers aspire to adapt Thermometer for more complex text-generating tasks and explore its application to even bigger LLMs. They also aim to quantify the diversity and number of labeled datasets necessary for training a Thermometer model capable of generalizing to new tasks.
This groundbreaking research received funding from the MIT-IBM Watson AI Lab.
Photo credit & article inspired by: Euronews