Large language models like ChatGPT could help physicians handle patient communication and questions about AMD and related retinal topics, but experts caution that these responses would still need to be edited for accuracy, which also takes time.

Large language models like ChatGPT could help physicians handle patient communication and questions about AMD and related retinal topics, but experts caution that these responses would still need to be edited for accuracy, which also takes time. Photo: NEI. Click image to enlarge.

The internet has long provided answers—of varying accuracy—to patients’ many health-related queries, and now artificial intelligence models like ChatGPT are in the mix too. How good is this information, though? New research published in Ophthalmology Science suggests it has potential. Researchers assessed the quality, safety and empathy of responses to common questions from retina patients by human experts, by AI and by AI responses edited by human experts. They concluded that clinical settings might make good use of AI responses.

In the masked, multicenter study, researchers randomly assigned 21 common retina patient questions among 13 retina specialists. A few examples include the following:

  • What causes age-related macular degeneration?
  • How long do I need to keep getting anti-VEGF injections?
  • Can I pass AMD to my children?
  • How long can I go between eye injections?
  • Is there a good treatment for floaters?

Each expert created a response and then edited a response generated by the large language model (LLM) ChatGPT-4. They timed themselves for both tasks. Five other LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing and Bard) also generated responses to each of the 21 questions. Other experts not involved in this initial process evaluated the responses and judged them for quality and empathy (very poor, poor, acceptable, good or very good) and for safety (incorrect information, likelihood to cause harm, extent of harm and missing content).

The researchers collected 4,008 grades (2,608 for quality and empathy and 1,400 for safety metrics). They reported significant differences in quality and empathy between three groups: LLM alone, expert alone and expert+AI. The latter—expert+AI—performed best overall in terms of quality, with ChatGPT-3.5 as the top-performing LLM. ChatGPT-3.5 had the highest mean empathy score followed by expert+AI. Expert responses placed fourth out of seven for quality and sixth out of seven for empathy (mean score), according to the study. Expert+AI responses significantly exceeded expert responses for quality and empathy.

The researchers reported time savings for expert+AI responses vs. expert-created responses. They also found that ChatGPT-4 had similar performance to experts for inappropriate content, missing content, extent of possible harm and likelihood of possible harm. Of note, however—the authors suggested that the length of the response may have influenced the likelihood of its containing inappropriate material or material with the potential to lead to harm. “Busy physicians will need to take the time to proofread longer LLM responses to mitigate possible harm and limit incorrect content,” the researchers pointed out in their paper.

“Overall, these data indicate that expert-edited LLM can perform better in both quality and empathy of responses compared to answers generated by human experts alone while providing valuable time savings, thereby improving patient education and communication,” the researchers concluded in their Ophthalmology Science paper. “A natural next step would be testing an editable LLM-generated draft to patient messages, thus reaping the benefits of improved quality, empathy and practice efficiency.”

Tailor PD, Dalvin LA, Chen JJ, et al. A comparative study of responses to retina questions from either experts, expert-edited large language models (LLMs) or LLMs alone. Ophthalmology Science 2024. [Epub ahead of print].