To enhance the capabilities of large language models (LLMs), researchers frequently rely on extensive collections of datasets, which amalgamate a variety of information from thousands of web sources. However, as data is combined and reorganized into multiple datasets, critical details about their origins and usage limitations often become obscured.
This lack of clarity not only raises legal and ethical concerns but can also hinder a model’s effectiveness. For example, if a dataset is incorrectly categorized, an individual training a machine-learning model for a specific function might inadvertently use data inappropriate for that purpose.
Moreover, the inclusion of data from unknown sources can introduce biases, leading to unfair predictions when models are implemented. To address these issues, a multidisciplinary team of researchers from MIT and other institutions initiated a systematic audit of over 1,800 text datasets hosted on popular platforms. They discovered that more than 70 percent of the datasets lacked crucial licensing information, while about half had erroneous data.
The researchers then developed a user-friendly tool called the Data Provenance Explorer, which automates the generation of clear summaries detailing a dataset’s creators, sources, licenses, and permissible uses.
“These types of tools can help regulators and practitioners make informed decisions about AI deployment, and promote responsible AI development,” states Alex “Sandy” Pentland, a professor at MIT, leader of the Human Dynamics Group in the MIT Media Lab, and co-author of a recent open-access paper on the project.
The Data Provenance Explorer can support AI practitioners in constructing more effective models by allowing them to choose training datasets that align with their model’s intended application. Ultimately, this could enhance the accuracy of AI systems in real-world scenarios, such as evaluating loan applications or responding to customer inquiries.
“To fully grasp an AI model’s strengths and limitations, one must understand the data it was trained on. Misattribution and confusion about data origins contribute to significant transparency issues,” says Robert Mahari, a graduate student within the MIT Human Dynamics Group and co-lead author of the paper.
In their research, Mahari and Pentland collaborated with co-lead author Shayne Longpre, a graduate student at the Media Lab, and Sara Hooker, director of the AI research lab Cohere for AI, along with contributors from various institutions including MIT, the University of California at Irvine, the University of Lille in France, and others. Their findings are published in Nature Machine Intelligence.
Emphasizing Fine-Tuning
Researchers often employ fine-tuning to refine the abilities of large language models tailored for particular tasks such as question-answering. This process typically involves creating curated datasets aimed at amplifying a model’s effectiveness for that specific task.
The MIT team concentrated on these fine-tuning datasets, which are frequently compiled by researchers, academic institutions, or companies, and are licensed for specific purposes. However, when crowdsourced platforms aggregate such datasets into broader collections for practitioner use, critical licensing information can be lost.
“These licenses should be significant and enforceable,” Mahari emphasizes.
If the licensing details of a dataset are incorrect or missing, it could lead individuals to invest substantial time and resources into developing a model that may ultimately need to be withdrawn due to the inclusion of confidential data.
“Individuals can find themselves training models without fully understanding their capabilities, concerns, or risks—all of which stem from the underlying data,” Longpre adds.
To frame this study, the researchers defined data provenance as encompassing a dataset’s sourcing, creation, licensing lineage, and characteristics. They established a structured auditing method to trace data provenance across more than 1,800 text dataset collections from major online repositories.
After discovering that over 70 percent of these datasets had “unspecified” licenses lacking essential information, the researchers reversed course to fill these gaps. Their efforts reduced the percentage of datasets with “unspecified” licenses to about 30 percent.
Additionally, they found that the correct licenses were often more limiting than those originally assigned by the hosting platforms. The analysis also indicated that a majority of dataset creators are based in the Global North, which could restrict a model’s effectiveness when deployed in different regions. For instance, a Turkish language dataset predominantly created by individuals from the U.S. or China might overlook significant cultural components, according to Mahari.
“We can easily mislead ourselves into thinking these datasets are more diverse than they truly are,” he notes.
Notably, the researchers observed a significant increase in restrictions on datasets created in 2023 and 2024, potentially reflecting growing concerns among academics over the unintended commercial use of their datasets.
A User-Friendly Solution
To streamline access to this critical information, the team developed the Data Provenance Explorer. This tool not only facilitates sorting and filtering datasets based on various criteria but also allows users to download a data provenance card that succinctly summarizes dataset characteristics.
“We aim for this to serve as a foundational step toward understanding the dataset landscape and enabling people to make more informed decisions about the data they utilize for training,” Mahari explains.
Looking ahead, the researchers plan to extend their analysis to examine data provenance for multimodal data, such as video and speech, and to investigate how terms of service from data-sourcing websites influence datasets.
As they advance their research, they are also engaging with regulators to discuss their findings and the unique copyright implications associated with fine-tuning data.
“We require transparency and data provenance from the outset when datasets are created and released, to simplify the process for others seeking to derive insights,” asserts Longpre.
“Many proposed policy interventions assume that proper licenses can be accurately assigned and identified with data, and this work highlights that assumption’s inaccuracies while significantly enhancing the availability of provenance information,” adds Stella Biderman, executive director of EleutherAI, who did not participate in the study. “Moreover, the legal discussions presented are immensely valuable to machine learning practitioners universally—not just to those within large companies with dedicated legal teams. Many aspiring AI developers are currently navigating the complexities of data licensing in silence, since the internet does not facilitate easy understanding of data provenance.”
Photo credit & article inspired by: Massachusetts Institute of Technology