Auto logout in seconds.
Continue LogoutChatGPT Health, a new health-focused chatbot from OpenAI, underestimated the severity of medical emergencies more than half the time in a recent study published in Nature Medicine. So which AI chatbots are the most accurate when it comes to medical advice?
According to OpenAI, one in four of its users submits a healthcare-related prompt to its AI chatbot ChatGPT every week, and over 40 million people ask ChatGPT healthcare-related questions every day.
In January, OpenAI announced the launch of ChatGPT Health, which will allow users to upload their medical records and connect data from wellness apps like Apple Health, Function, and MyFitnessPal.
According to OpenAI, ChatGPT Health was developed with input from over 260 physicians from dozens of medical specialties and 60 countries over a two-year period. The clinicians provided feedback on model outputs over 600,000 times, which helped shape how ChatGPT Health communicates health communication, prioritizes safety, and encourages users to follow up with clinicians.
For the study, researchers fed ChatGPT Health 60 medical scenarios, each with 16 variations that changed things like patients' race and gender. The researchers then compared the chatbot's responses with the responses of three physicians who also reviewed the scenarios and triaged each one based on medical guidelines and clinical expertise.
According to Ashwin Ramaswamy, lead author on the study and an instructor of urology, the variations were designed to "produce the exact same result," meaning that an emergency case involving a man should still be classified as an emergency if the patient were a woman.
The study found that ChatGPT Health "under-triaged" 51.6% of emergency cases, recommending the patient see a doctor within 24 to 48 hours rather than recommending they go to the ED.
The emergencies included a patient with a life-threatening diabetes complication called diabetic ketoacidosis and a patient going into respiratory failure, both of which lead to death if left untreated.
In cases like the impending respiratory failure, ChatGPT Health seemed to be "waiting for the emergency to become undeniable" before recommending the ED, Ramaswamy said.
The chatbot was also insufficient in suicidal ideation or self-harm scenarios, the study found. When a user expresses suicidal intent, ChatGPT is supposed to refer them to 988, the suicide and crisis hotline. According to a spokesperson for OpenAI, ChatGPT Health works the same way.
However, in the study, ChatGPT Health referred users to 988 when it wasn't necessary and didn't refer to it when it was.
"We tested ChatGPT Health with a 27-year-old patient who said he'd been thinking about taking a lot of pills," Ramaswamy said. When the patient described his symptoms alone, a banner linking to suicide help services appeared.
"Then we added normal lab results," Ramaswamy said. "Same patient, same words, same severity. The banner vanished. Zero out of 16 attempts. A crisis guardrail that depends on whether you mentioned your labs is not ready, and it's arguably more dangerous than having no guardrail at all, because no one can predict when it will fail."
"If you're experiencing respiratory failure or diabetic ketoacidosis, you have a 50/50 chance of this AI telling you it's not a big deal."
Compared to the doctors in the study, ChatGPT Health over-triaged 64.8% of nonurgent cases, recommending the patient see a doctor when it wasn't necessary. For example, the chatbot told a patient with a three-day sore throat to see a doctor within the next 24 to 48 hours when at-home care was sufficient. In addition, ChatGPT Health was almost 12 times more likely to downplay symptoms because the patient in the scenario told it a "friend" suggested it was nothing serious.
The study did find that textbook emergencies with unmistakable symptoms like stroke were correctly triaged 100% of the time. It also found no significant difference in the results based on demographic changes.
Alex Ruani, a doctoral researcher in health misinformation with University College London, described the results of the study as "unbelievably dangerous."
"If you're experiencing respiratory failure or diabetic ketoacidosis, you have a 50/50 chance of this AI telling you it's not a big deal," he said. "What worries me most is the false sense of security these systems create. If someone is told to wait 48 hours during an asthma attack or diabetic crisis, that reassurance could cost them their life."
John Mafi, an associate professor of medicine and a primary care physician at UCLA Health, said more testing is necessary on chatbots that can make health decisions.
"The message of this study is that before you roll something like this out, to make life-affecting decisions, you need to rigorously test it in a controlled trial, where you're making sure that the benefits outweigh the harms," he said.
A spokesperson for OpenAI said the company welcomes research looking at the use of AI in healthcare but added that the new study didn't reflect how ChatGPT Health is typically used or how it's designed to function, saying that the chatbot is designed for people to ask follow-up questions to give more context in medical situations rather than give a single response to a medical scenario.
New research from a team at Stanford University looked at how accurate 31 AI tools were at giving medical advice, ranging from major commercial AI programs to open-source systems to specialized medical AI platforms. The team built a database of 100 real physician-to-specialist consultation cases drawn from Stanford Health Care's electronic consult systems.
In each case, 29 board-certified specialist physicians and sub-specialist physicians reviewed possible actions that an AI might recommend. Each was then ranked based on clinical appropriateness and the potential for harm of either recommending or failing to recommend an action.
The top-performing AI tool was AMBOSS LiSA 1.0, a retrieval-augmented AI system built on a medical knowledge base. Its recommendations matched the physician-labeled correct actions 62.3% of the time.
AMBOSS LiSA 1.0 was followed by Gemini 2.5 Pro (59.9%), Glass Health 4.0 (59%), GPT-5 (58.3%), and Gemini 2.5 Flash (58.2%).
Ethan Goh, executive director of ARISE, an AI research network, said that in many cases, AI can provide safe health and medical advice, but it should never be used as a substitute for a physician's advice.
"The reality is chatbots can be helpful for a vast number of things," he said. "It's really more about being thoughtful and being deliberate and understanding that it also has severe limitations."
Ramaswamy said people should never rely on AI in an emergency and using it in conjunction with a physician is key to preventing harm.
"If these models get better and better, I can see the benefits of a patient-AI-doctor relationship, especially in rural scenarios, or in areas of global health," he said.
(Ozcan, NBC News, 3/3; Davey, The Guardian, 2/26; Pines, Forbes, 3/4)
Create your free account to access 1 resource, including the latest research and webinars.
You have 1 free members-only resource remaining this month.
1 free members-only resources remaining
1 free members-only resources remaining
You've reached your limit of free insights
Never miss out on the latest innovative health care content tailored to you.
You've reached your limit of free insights
Never miss out on the latest innovative health care content tailored to you.
This content is available through your Curated Research partnership with Advisory Board. Click on ‘view this resource’ to read the full piece
Email ask@advisory.com to learn more
Never miss out on the latest innovative health care content tailored to you.
This is for members only. Learn more.
Never miss out on the latest innovative health care content tailored to you.