Artificial intelligence (AI) is increasingly being utilized for medical diagnoses, particularly in the analysis of imaging data such as X-rays. Despite the potential of these models, recent studies indicate that they do not perform uniformly well across different demographic groups, often exhibiting lower accuracy when evaluating images of women and people of color.
In a groundbreaking study in 2022, researchers at MIT uncovered that AI systems could accurately predict a patient’s race from chest X-rays—an ability that surpasses even the most skilled radiologists. Building on these findings, the research team discovered a troubling correlation: the models most proficient in demographic predictions also exhibited the most significant “fairness gaps.” These gaps refer to inconsistencies in diagnostic accuracy across different races and genders, suggesting these models may rely on what are known as “demographic shortcuts” for their conclusions, leading to potential misdiagnoses for women, Black individuals, and other marginalized groups.
“High-capacity machine-learning models are known for their predictive abilities regarding demographics such as self-reported race, sex, or age. This paper confirms that capacity and links it to performance disparities across various groups, which is a novel approach,” states Marzyeh Ghassemi, an associate professor at MIT and the study’s lead author.
The research team also found promising solutions for enhancing the fairness of these AI models. They determined that retraining the systems improved fairness, but only if the models were tested on patient groups similar to those they were trained upon. When applied to patients from different hospitals, fairness gaps re-emerged.
“It’s crucial to evaluate any external AI models against your own patient data. Fairness guarantees from model developers may not apply to your population. Furthermore, when feasible, train models on data representative of your patient population,” advises Haoran Zhang, an MIT graduate student and one of the lead authors. The study, which includes contributions from Yuzhe Yang, appears today in Nature Medicine.
Addressing Bias in AI Models
As of May 2024, the FDA has given the green light to 882 AI-enabled medical devices, with 671 specifically designed for use in radiology. Since 2022, researchers, including Ghassemi and her team, have demonstrated that these models can also predict gender and age, even though they weren’t explicitly trained to do so.
“Many state-of-the-art machine learning models exhibit superhuman abilities in demographic prediction—features that even experienced radiologists cannot detect from X-rays,” Ghassemi notes. “While these models are adept at predicting medical conditions, they also unintentionally learn to anticipate other metrics that may not be desirable.”
The research focused on evaluating why certain demographic groups received less accurate predictions. Specifically, the team aimed to determine if demographic shortcuts were in play, wherein the AI models use demographic information to diagnose conditions instead of relying solely on visual features of the X-rays themselves.
To conduct this study, the researchers employed publicly available datasets from Beth Israel Deaconess Medical Center and trained their AI models to identify three distinct medical conditions: pulmonary edema, pneumothorax, and cardiomegaly. Upon testing the models against excluded X-ray samples, they found a general trend of solid performance but also identified significant fairness gaps when comparing diagnostic rates between men and women, and between white and Black patients.
The models succeeded not only in diagnosing medical conditions but also in predicting demographic attributes like gender, race, and age. A noteworthy discovery showed a link between a model’s capability to predict demographics accurately and its corresponding fairness gap, implying that demographic categorizations influenced the disease predictions.
The researchers employed two innovative strategies to mitigate fairness gaps during their trials. One approach targeted “subgroup robustness,” rewarding models for improved performance in historically underrepresented demographics while penalizing higher error rates among any group. Another innovative approach utilized “group adversarial” methods to eliminate demographic information from the models entirely. Both strategies yielded positive results in alleviating fairness gaps.
Challenges in Generalization
However, the effectiveness of these strategies diminished when models trained on specific patient data were applied to new patient demographics from different hospitals. Disconcertingly, the debiased models showed that while overall performance remained high, fairness disparities returned when tested on data from outside their original training set.
“Debiasing a model within one population doesn’t guarantee the same level of fairness when applied to a different demographic,” Zhang explains. “This is concerning, as healthcare providers often apply models developed using data from distinct hospitals.”
The research revealed that even top-performing models optimized for their training datasets are not necessarily the best fit for new populations, which poses significant challenges for real-world applications. The research team is now investigating further techniques to enhance the adaptability of AI models to promote fair predictions across diverse demographic groups.
The study underscores the importance of thorough evaluation of AI diagnostic tools within specific patient populations to ensure accurate results across all demographics. The research was backed by various organizations, including the Google Research Scholar Award and the National Institutes of Health.
Photo credit & article inspired by: Massachusetts Institute of Technology