Ir directamente a la navegación principal Ir directamente a la búsqueda Ir directamente al contenido principal

Comparative analysis of large language models as decision support tools in oral pathology

  • Valentina Ignacia Alvarez-Silberberg
  • , Camila Paz Alvarez-Silberberg
  • , Cosimo Galletti
  • , Javier Flores-Fraile
  • , Valeria Ramirez
  • , Cristian Bravo Palma
  • , Victor Gil-Manich
  • , Luca Fiorillo
  • , Vini Mehta*
  • , Maria Teresa Fernández-Figueras
  • , Maria Cuevas-Nunez
  • , Cosimo Galletti
  • *Autor correspondiente de este trabajo

Producción científica: Contribución a una revistaArtículorevisión exhaustiva

1 Cita (Scopus)

Resumen

This study evaluated the performance of four large language model based chatbots (LLMs) (ChatGPT-4.0, ChatGPT o1-preview, Gemini, and Meta AI) as decision-support systems for interpreting histopathologic descriptions of oral lesions, assessing agreement between their s generated a suggested primary interpretation and three differential diagnoses. Outputs were categorized as Different, Similar, or Correct compared to the consensus reference diagnosis established by two board-certified pathologists. Statistical analyses included the Friedman test to compare model performance, Wilcoxon signed-rank tests for pairwise comparisons, Cohen’s κ to assess agreement, and regression analyses to evaluate the influence of age and sex. Differential diagnosis performance was also analyzed. ChatGPT o1-preview demonstrated the highest proportion of outputs concordant with the reference diagnosis (68.6%), followed by Meta AI (65.7%), ChatGPT-4.0 (59.8%), and Gemini (27.5%). In terms of agreement with oral pathologists, ChatGPT o1-preview (κ = 0.66) and Meta AI (κ = 0.63) showed substantial agreement, ChatGPT-4.0 demonstrated moderate agreement (κ = 0.57), and Gemini showed poor agreement (κ = 0.24). Increasing patient age was associated with a mild but statistically significant reduction in model performance for ChatGPT-4.0, Meta AI, and Gemini, while no significant age effect was observed for ChatGPT o1-preview; patient sex had no significant impact. Among the evaluated chatbots, ChatGPT o1-preview showed the highest alignment with oral pathologists’ reference diagnoses. These findings support the potential role of LLMs as complementary decision-support tools for interpreting oral histopathology descriptions, while highlighting substantial inter-model variability and the need for cautious implementation with continued human oversight.

Idioma originalInglés
Número de artículo11272
PublicaciónScientific Reports
Volumen16
N.º1
DOI
EstadoPublicada - 27 feb. 2026

Nota bibliográfica

© 2026. The Author(s).

Huella

Profundice en los temas de investigación de 'Comparative analysis of large language models as decision support tools in oral pathology'. En conjunto forman una huella única.

Citar esto