Unequal outcomes: Tackling bias in clinical AI models
A new study by SRI Graduate Affiliate Michael Colacci sheds light on the frequency of biased outcomes against different sociodemographic groups when machine learning algorithms are used in healthcare contexts, advocating for more comprehensive and standardized approaches to evaluating AI systems. (Image: Vimal Saran/Unsplash)
The promise of artificial intelligence (AI) in healthcare is one of precision, efficiency, and innovation, offering game-changing new opportunities for diagnosing diseases more accurately and optimizing delivery of care. However, a troubling reality underpins this optimism: AI tools are not immune to bias, raising concerns about equity.
A newly-published scoping review in the Journal of Clinical Epidemiology by SRI Graduate Affiliate Michael Colacci and co-authors at the University of Toronto and St. Michael’s Hospital sheds light on how frequently ML algorithms in healthcare contexts generate biased outcomes. Their findings paint a stark picture: nearly 75% reported bias, predominantly against disadvantaged groups.
An internal medicine physician completing a PhD in clinical epidemiology in U of T’s Institute of Health Policy, Management, and Evaluation, Colacci’s research focuses on the safe and equitable development and deployment of ML tools in medicine. The study forms part of his research project conducted as a 2023–24 SRI Graduate Fellow.
“As clinical AI tools become increasingly integrated into patient care, it is essential to evaluate not only their overall performance but also their effectiveness across diverse patient populations,” says Colacci.
Charting the prevalance of bias in clinical ML models
Colacci and his co-authors examined 91 studies that evaluated bias in ML models designed for clinical care. The results are sobering but instructive, with nearly three-quarters of the studies identifying evidence of bias, and the majority of these biases disproportionately affecting disadvantaged groups.
Race, gender, and age were the most commonly evaluated factors, with race appearing in nearly two-thirds of studies. Yet the narrow focus of these evaluations leaves significant gaps. Other factors, such as income level, language, or the intersections of multiple identities, were rarely considered. Colacci emphasizes that this limited scope risks overlooking more deeply embedded social inequities.
The study demonstrates that many biases stem from underrepresentation in training datasets, which fail to adequately represent the diversity of real-world populations. This lack of representation leads to algorithms performing poorly for underrepresented groups. For instance, an ML tool trained primarily on data from urban, affluent patients may struggle to accurately predict outcomes for rural, low-income individuals. Worse still, some algorithms inadvertently amplify existing inequities by relying on biased data, such as unequal healthcare resource allocation patterns. This phenomenon, known as data label bias, reflects systemic disparities and reinforces them through algorithmic decisions.
In addition to issues with training data, the intrinsic design of ML models can contribute to bias. Some algorithms prioritize variables or features that inadvertently disadvantage certain populations. For example, socioeconomic factors may be weighted in ways that penalize those already marginalized. Colacci’s review makes it clear that these issues span a wide range of clinical applications, from disease identification and imaging analysis to resource allocation and prognostic models.
SRI Graduate Affiliate Michael Colacci researches the safe and equitable development and deployment of clinical ML tools.
Building new strategies for equitable healthcare
Colacci’s research is a call to action, offering a roadmap for building fair and equitable AI in healthcare. His study examines strategies to mitigate bias, such as reweighting data to improve the representation of disadvantaged groups and recalibrating algorithms to account for variability in performance across populations.
Adjusting the type of model used, for example, has proven particularly effective in addressing bias in certain contexts. While no single solution is universally applicable, Colacci views this as an opportunity for innovation. The key, he argues, lies in tailoring mitigation strategies to specific challenges and continuing to refine these approaches through research and practice.
Colacci’s optimism is grounded in both his medical practice and his research. As a physician, he sees the transformative potential of AI to enhance patient care. As a researcher, he is committed to ensuring that these technologies work for everyone, not just the majority.
“AI has the potential to democratize healthcare,” he says, “but only if we design it to address the needs of all populations, especially those who have historically been underserved.”
The study also underscores the importance of transparency and collaboration. Colacci believes that healthcare providers, AI developers, and policymakers must work together to ensure that ML tools are developed and deployed responsibly. This includes creating diverse and representative datasets, implementing robust fairness evaluations, and engaging with communities to understand the real-world implications of these technologies. Such efforts, he argues, are essential not only for improving algorithmic performance but also for building trust among patients and providers.
Looking ahead, Colacci envisions a future where clinical ML tools don’t just diagnose diseases or predict outcomes—they actively reduce disparities and promote equity in healthcare. He sees progress being made, from the increasing inclusion of fairness evaluations in AI research to the development of guidelines that standardize how bias is assessed and mitigated.
“We’re starting to see a shift,” he notes, “but there’s still a lot of work to be done.”
Insights and directions for future research
The implications of the study’s findings are significant for both healthcare providers and AI developers. Clinical ML models are increasingly being deployed in settings where their decisions directly impact patient care.
To address these challenges, Colacci emphasizes the need for more comprehensive and standardized approaches to evaluating bias in clinical AI. Researchers must broaden their focus beyond the most commonly assessed factors of race, gender, and age to include a wider range of sociodemographic characteristics. Intersectional analyses, which examine how overlapping identities such as race and gender interact to produce unique forms of disadvantage, are particularly crucial. Additionally, there is a need for clear guidelines on what constitutes bias and how it should be measured. Such standards would enable more consistent evaluations and facilitate the development of effective mitigation strategies.
Equally important is the need to embed equity considerations into the design and deployment of ML models from the outset. This requires diverse and representative datasets, transparent model development processes, and ongoing monitoring of model performance across different populations. SRI Faculty Affiliate Marzyeh Ghassemi of MIT was recently awarded a Schmidt Sciences AI2050 Early Career Fellowship for her work on improving data used to train AI systems and developing tools to monitor performance.
As the use of AI in healthcare continues to expand, Colacci’s research serves as a vital guide for ensuring these technologies fulfill their promise. His work is a reminder that the benefits of AI must be equitably distributed, and that achieving this goal requires vigilance, innovation, and a commitment to justice.
“The challenges are real,” he acknowledges, “but so are the opportunities. If we get this right, AI can be one of the most powerful tools we have to create a more equitable healthcare system.”