Generative AI is rapidly gaining recognition for its capability to produce text and images, yet these forms of content represent only a small portion of the vast data landscape we encounter daily. Data is constantly being generated—whether it’s during a patient’s journey through the healthcare system, disruptions caused by severe weather, or user interactions with software applications.
By leveraging generative AI to produce realistic synthetic data tailored to these scenarios, organizations can enhance patient treatment, optimize flight operations, and refine software platforms, particularly in situations where access to real-world data is limited or sensitive.
For the past three years, DataCebo, a spinout from MIT, has been at the forefront of this innovation with its generative software system, the Synthetic Data Vault (SDV). This tool empowers organizations to create synthetic data for testing software applications and training machine learning models.
The SDV has seen over 1 million downloads, with more than 10,000 data scientists utilizing this open-source library to generate synthetic tabular data. Founders Kalyan Veeramachaneni, a Principal Research Scientist, and Neha Patki, an alumna of MIT, attribute the tool’s widespread adoption to its game-changing impact on software testing.
SDV Takes Off
In 2016, Veeramachaneni’s team in the Data to AI Lab launched a suite of open-source generative AI tools aimed at creating synthetic data that replicates the statistical properties of real-world data.
Companies can now utilize synthetic data instead of sensitive real data in their applications while maintaining the underlying statistical relationships, allowing new software to be tested through simulations before public release.
This issue was identified when Veeramachaneni’s group collaborated with organizations eager to share their data for research purposes.
“At MIT, you gain insights from various use cases,” Patki notes. “You collaborate with finance and healthcare sectors, leading to solutions applicable across multiple industries.”
In 2020, the team established DataCebo to expand SDV features tailored for larger organizations, resulting in a diverse array of notable applications.
For example, with DataCebo’s new flight simulation tool, airlines can prepare for rare weather scenarios that are unlikely to be adequately represented by historical data. Additionally, SDV has been utilized to synthesize medical records, predicting health outcomes for cystic fibrosis patients. A recent initiative in Norway used SDV to generate synthetic student data to assess the fairness of various admissions policies.
In 2021, a competition hosted on the data science platform Kaggle invited data scientists to create synthetic data sets with SDV, attracting around 30,000 participants eager to devise solutions and forecast outcomes based on realistic datasets.
DataCebo has maintained its strong ties to MIT, with all current employees being alumni of the prestigious institution.
Enhancing Software Testing
While the open-source tools serve numerous functions, DataCebo is particularly focused on amplifying its impact in software testing.
“Data is crucial for testing software applications,” Veeramachaneni highlights. “Traditionally, developers manually scripted synthetic data. With our generative models derived from SDV, one can learn from a sample and generate extensive volumes of synthetic data with the same properties as real-world data, or create specific scenarios to rigorously test software.”
For instance, if a bank aims to verify a software solution that prevents overdraft transfers, it requires simulations of numerous accounts executing transactions simultaneously. Generating such data manually would be a tedious process. DataCebo’s generative models enable clients to create any edge cases necessary for thorough testing.
“In many fields, data sensitivity is a concern,” Patki points out. “Companies often face regulations surrounding data access; even absent legal constraints, it’s prudent for companies to manage access diligently. Synthetic data presents a more privacy-conscious alternative.”
Expanding Synthetic Data
Veeramachaneni asserts that DataCebo is pioneering advances in what they term synthetic enterprise data—data generated from user interactions with large-scale software applications.
“This form of enterprise data is intricate and lacks universal availability, unlike language data,” he explains. “When users engage with our open-source software, their feedback on its performance reveals unique patterns, allowing us to enhance our algorithms. Essentially, we are developing a collection of complex patterns that are not as readily accessible in language or imagery.”
DataCebo has also introduced notable enhancements to SDV, including tools for assessing the realism of generated data—known as the SDMetrics library—and a framework for comparing model performances called SDGym.
“Trust in the new data is crucial for organizations,” Veeramachaneni emphasizes. “[Our tools provide] programmable synthetic data, permitting enterprises to inject their insights and intuitions into model-building for greater transparency.”
As businesses across various sectors race to integrate AI and data science technologies, DataCebo is paving the way for responsible and transparent adoption.
“In the coming years, generative models producing synthetic data will fundamentally reshape all data-related work,” Veeramachaneni predicts. “We believe that up to 90 percent of enterprise operations can be conducted using synthetic data.”
Photo credit & article inspired by: Massachusetts Institute of Technology