Skip to main navigation Skip to search Skip to main content

Comparative analysis of large language models as decision support tools in oral pathology

  • Valentina Ignacia Alvarez-Silberberg
  • , Camila Paz Alvarez-Silberberg
  • , Cosimo Galletti
  • , Javier Flores-Fraile
  • , Valeria Ramirez
  • , Cristian Bravo Palma
  • , Victor Gil-Manich
  • , Luca Fiorillo
  • , Vini Mehta*
  • , Maria Teresa Fernández-Figueras
  • , Maria Cuevas-Nunez
  • , Cosimo Galletti
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

This study evaluated the performance of four large language model based chatbots (LLMs) (ChatGPT-4.0, ChatGPT o1-preview, Gemini, and Meta AI) as decision-support systems for interpreting histopathologic descriptions of oral lesions, assessing agreement between their s generated a suggested primary interpretation and three differential diagnoses. Outputs were categorized as Different, Similar, or Correct compared to the consensus reference diagnosis established by two board-certified pathologists. Statistical analyses included the Friedman test to compare model performance, Wilcoxon signed-rank tests for pairwise comparisons, Cohen’s κ to assess agreement, and regression analyses to evaluate the influence of age and sex. Differential diagnosis performance was also analyzed. ChatGPT o1-preview demonstrated the highest proportion of outputs concordant with the reference diagnosis (68.6%), followed by Meta AI (65.7%), ChatGPT-4.0 (59.8%), and Gemini (27.5%). In terms of agreement with oral pathologists, ChatGPT o1-preview (κ = 0.66) and Meta AI (κ = 0.63) showed substantial agreement, ChatGPT-4.0 demonstrated moderate agreement (κ = 0.57), and Gemini showed poor agreement (κ = 0.24). Increasing patient age was associated with a mild but statistically significant reduction in model performance for ChatGPT-4.0, Meta AI, and Gemini, while no significant age effect was observed for ChatGPT o1-preview; patient sex had no significant impact. Among the evaluated chatbots, ChatGPT o1-preview showed the highest alignment with oral pathologists’ reference diagnoses. These findings support the potential role of LLMs as complementary decision-support tools for interpreting oral histopathology descriptions, while highlighting substantial inter-model variability and the need for cautious implementation with continued human oversight.

Original languageEnglish
Article number11272
JournalScientific Reports
Volume16
Issue number1
DOIs
StatePublished - 27 Feb 2026

Bibliographical note

© 2026. The Author(s).

Keywords

  • Artificial intelligence
  • ChatGPT
  • Chatbot
  • Gemini
  • Large language models
  • Meta AI
  • Oral and maxillofacial pathology
  • Diagnosis, Differential
  • Humans
  • Middle Aged
  • Decision Support Techniques
  • Male
  • Decision Support Systems, Clinical
  • Young Adult
  • Language
  • Adolescent
  • Female
  • Adult
  • Aged
  • Pathology, Oral/methods
  • Large Language Models

Fingerprint

Dive into the research topics of 'Comparative analysis of large language models as decision support tools in oral pathology'. Together they form a unique fingerprint.

Cite this