Daily Briefing

'Really, really concerning': Experts sound alarm on AI medical biases


Artificial intelligence (AI) has advanced tremendously in recent months, with some research finding that it can create clinical notes on par with those written by medical residents. However, researchers say that healthcare leaders should remain cautious about using AI for medical care since it can still produce problematic and biased results.  

AI may produce biased results in medical tasks

In a new preprint study, researchers inputted several case studies from the New England Journal of Medicine Healer tool into the generative AI model GPT-4 and asked it to provide a list of potential diagnoses and treatment recommendations for each scenario.

The case studies included a range of patient symptoms including chest pain, difficulty breathing, sore throat, and more. Each time, the researchers would change the patient's gender and race to see how GPT-4 would adjust its output.

Overall, GPT-4's answers did not differ significantly between groups, but the model did rank possible diagnoses differently depending on a potential patient's gender or race.

For example, when GPT-4 was told that a female patient had shortness of breath, it ranked panic and anxiety disorder higher on its list of potential diagnoses, which the researchers say reflect known biases in the clinical literature used to train the model.

In addition, when a patient with a sore throat was presented to GPT-4, it made the correct diagnosis (mono) 100% when the patient was white, but only 86% of the time for Black men, 73% for Hispanic men, and 74% for Asian men.

Treatment suggestions also varied by race and gender. For all 10 ED cases presented to the model, it was significantly less likely to recommend a CT scan for a Black patient and less likely to rate cardiovascular stress tests and angiography as being of high importance for women compared to men.

In addition, while some variation is expected — and even wanted — in a list of potential diagnoses, the researchers found that the AI often overestimated the real-world prevalence of certain diseases, which could amplify certain trends when applied to training or clinical practice.

For example, when the researchers asked GPT-4 to generate clinical vignettes of a sarcoidosis patient, the model described a Black woman 98% of the time.

"Sarcoidosis is more prevalent both in African Americans and in women," said Emily Alsentzer, a postdoctoral fellow at Brigham and Women's Hospital and Harvard Medical School and one of the study's authors, "but it's certainly not 98% of all patients."

Commentary

Adam Rodman, co-director of the iMED Initiative at Beth Israel Deaconess Medical Center, said because GPT-4 was trained off human communication, it "shows the same — or maybe even more exaggerated — racial and sex biases as humans."

"Despite years of training these things to be less terrible, they still reflect many of these more subtle biases," he added. "It still reflects the biases of its training data, which is concerning given what people are using GPT for right now."

And if these subtle biases are not checked by physicians using GPT-4, "it's hard to know whether there might be systemic biases in the response that you give to one patient or another," Alsentzer said — as well as whether the model could exacerbate existing health disparities.

Although these types of biases are not surprising to AI researchers, Rodman said "it's really, really concerning" to him. "Things are moving quickly, and doctors need to get on top of this," he added.

"Medical students are using GPT-4 to learn right now," Rodman said, which means they could easily reflect or exaggerate existing biases shown to them by the model. "How are they going to second-guess an LLM [large language model] if they use an LLM to train their own brains?"

In general, researchers say that GPT-4 and similar AI models will need to be improved significantly before they can be applied to patient care management. There will also likely need to be safeguards built into the technology before it's used for clinical decision making.

"No one should be relying on it to make a medical decision at this point," Rodman said. "I hope it hammers home the point that doctors should not be relying on GPT-4 to make management decisions." (Palmer, STAT+ [subscription required], 7/18; Zack et al., medRxiv, 7/17)


Infographic: How to combat AI bias

Learn how to reduce the risk of algorithmic bias in healthcare with this infographic that outlines challenges and steps to take.


SPONSORED BY

INTENDED AUDIENCE

AFTER YOU READ THIS

AUTHORS

TOPICS

Don't miss out on the latest Advisory Board insights

Create your free account to access 1 resource, including the latest research and webinars.

Want access without creating an account?

   

You have 1 free members-only resource remaining this month.

1 free members-only resources remaining

1 free members-only resources remaining

You've reached your limit of free insights

Become a member to access all of Advisory Board's resources, events, and experts

Never miss out on the latest innovative health care content tailored to you.

Benefits include:

Unlimited access to research and resources
Member-only access to events and trainings
Expert-led consultation and facilitation
The latest content delivered to your inbox

You've reached your limit of free insights

Become a member to access all of Advisory Board's resources, events, and experts

Never miss out on the latest innovative health care content tailored to you.

Benefits include:

Unlimited access to research and resources
Member-only access to events and trainings
Expert-led consultation and facilitation
The latest content delivered to your inbox
AB
Thank you! Your updates have been made successfully.
Oh no! There was a problem with your request.
Error in form submission. Please try again.