Google's AI beat humans at diagnosing eye disease—but how? Here's a peek inside the 'black box.'

Read Advisory Board's take on this story.

A new artificial intelligence (AI) system from British researchers and Google DeepMind is as good as doctors at diagnosing certain eye diseases, according to a study published last month in Nature Medicine —and what's more, it can show users how it makes its conclusions, offering a look into the so-called "black box," Casey Ross reports for STAT News

Join tomorrow's webconference: How AI can improve your clinical and financial outcomes

The 'black box' dilemma

AI is drawing interest throughout the health care sector. Big names such as Apple, Amazon, IBM, and Microsoft are working on products for health care, Ross reports. Hospitals are eyeing AI as a means to boost care coordination, diagnostics, scheduling, and treatments. And drugmakers are leveraging AI to accelerate product development.

But there's a major obstacle to health care's adoption of AI: The technology lacks the ability to tell humans how it reaches its conclusions. This is often called the "black box" problem, according to Ross.

Study details

For the study, researchers tested Google's DeepMind system using historic optical CT scans of 997 patients from Moorfields Eye Hospital in London. The researchers compared the system's accuracy in diagnosing 50 conditions and triaging patients based on Moorefield's referral system. Eight human experts also examined the scans.  According to Ross, the DeepMind system uses "a novel architecture" that consists of two neutral networks. One translates raw ocular CT scans into a tissue map. The other analyzes the map to detect symptoms of eye diseases. The system allows the user to view a video of the section of a scan the system uses to make conclusions, as well as the confidence levels the system assigns to each possible diagnosis.

The system can identify 50 different eye diseases based on imaging data. It looks at 3-D scans, which means it can process more data than prior AI systems that use 2-D scans, according to Ross.

Study findings

The researchers found that when it came to the most urgent referrals, the AI system performed as well as the two best retinal specialists and outperformed two other specialists and four optometrists.

Overall, the AI system's error rate was 5.5%, which was lower than each of the top two human specialists, whose rates were 6.7% and 6.8%.

The researchers found several specialists performed almost as well as the AI system when they could review patient notes and other supplemental materials.

What about adoption?

Ross reports that while the system's performance is "promising," it won't be adopted "any time soon in hospitals or eliminate the need for human specialists to review scans." The study authors said the system needs to be refined and tested, including via randomized control trial, before it makes it to the clinical setting. And even still, Ross reports, the system would still need some human oversight.

Pearse Keane, an ophthalmologist at Moorfields Eye Hospital and a co-author of the study, said, "The key thing is that we do prospective studies in multiple different locations before we actually let this loose on patients." He added, "We all think this technology could be transformative, but we also acknowledge that it's not magic and we have to apply the same level of rigor to it that we would apply to any intervention."

John Miller, an ophthalmologist at Massachusetts Eye and Ear Hospital who was not involved in the study, said the biggest question about the DeepMind system is who would use it: primary care doctors, specialists, or pharmacists?

Miller said the system could afford benefits in all those settings, explaining that at the primary care level, it could be streamline referrals. "If we can be confident in a system that can identify retinal-specific disease at an early stage, that can prompt an earlier appointment for the patient and potentially save sight."

Miller added that the system might also save time, money, and patient anxiety over a misdiagnosis. "I think it's going to help us see more of the right types of patients, instead of screening some patients without the suggested disease .... I view it as augmenting my practice, not threatening it" (Ross, STAT News, 8/13).

Advisory Board's take

Greg Kuhnen, Senior Director

Artificial intelligence (AI) and machine learning have made remarkable progress addressing a broad range of health care decision making tasks, ranging from better readmissions predictions, to financial forecasting, to acute clinical diagnoses. However, one of the biggest debates around deploying these algorithms on the front lines is their “black box” nature—in many cases, no one can fully explain or predict the behavior of a system developed using machine learning.

Is this a problem? On one hand, medicine is full of examples of drugs, diagnostics, and heuristics we use widely but don't truly understand. We don't know the pathways that make Tylenol or Lithium work, but we understand their efficacy and risks enough to set guidelines about when they should and shouldn't be used. Therefore, some argue we should simply accept the black box nature of AI, and measure it using the same type of randomized controlled trials that are the gold standard in pharmaceutical research.

“Algorithms work well when presented with neatly packaged cases... but can fail—sometimes in spectacular ways—when shown something outside of their experience.”

The concern with AI is its blind spots. Our progress on AI has mostly been on creating narrowly focused, task-based algorithms—such as reading images and comparing them to a library of conditions—that often don't have any broader concept of who the patient is or the environment they receive care in. These algorithms work well when presented with neatly packaged cases similar to what they were trained on, but can fail—sometimes in spectacular ways—when shown something outside of their experience.

Google DeepMind and others doing similar work in "explainability" aim to split the difference. While the models they produce may not be able to explain themselves fully, they can provide strong clues about what parts of an image or attributes of a patient's chart were the most important factors in their drawing their conclusions. These clues help human supervisors and governance committees build confidence that the system's decisions are sensible, and help identify when an AI system is going off the rails.


Next in the Daily Briefing

Around the nation: After a record-breaking 509 days in the hospital, this baby is finally going home

Read now