TY - JOUR
T1 - Comparative analysis of large language models as decision support tools in oral pathology
AU - Alvarez-Silberberg, Valentina Ignacia
AU - Alvarez-Silberberg, Camila Paz
AU - Galletti, Cosimo
AU - Flores-Fraile, Javier
AU - Ramirez, Valeria
AU - Palma, Cristian Bravo
AU - Gil-Manich, Victor
AU - Fiorillo, Luca
AU - Mehta, Vini
AU - Fernández-Figueras, Maria Teresa
AU - Cuevas-Nunez, Maria
AU - Galletti, Cosimo
N1 - © 2026. The Author(s).
PY - 2026/2/27
Y1 - 2026/2/27
N2 - This study evaluated the performance of four large language model based chatbots (LLMs) (ChatGPT-4.0, ChatGPT o1-preview, Gemini, and Meta AI) as decision-support systems for interpreting histopathologic descriptions of oral lesions, assessing agreement between their s generated a suggested primary interpretation and three differential diagnoses. Outputs were categorized as Different, Similar, or Correct compared to the consensus reference diagnosis established by two board-certified pathologists. Statistical analyses included the Friedman test to compare model performance, Wilcoxon signed-rank tests for pairwise comparisons, Cohen’s κ to assess agreement, and regression analyses to evaluate the influence of age and sex. Differential diagnosis performance was also analyzed. ChatGPT o1-preview demonstrated the highest proportion of outputs concordant with the reference diagnosis (68.6%), followed by Meta AI (65.7%), ChatGPT-4.0 (59.8%), and Gemini (27.5%). In terms of agreement with oral pathologists, ChatGPT o1-preview (κ = 0.66) and Meta AI (κ = 0.63) showed substantial agreement, ChatGPT-4.0 demonstrated moderate agreement (κ = 0.57), and Gemini showed poor agreement (κ = 0.24). Increasing patient age was associated with a mild but statistically significant reduction in model performance for ChatGPT-4.0, Meta AI, and Gemini, while no significant age effect was observed for ChatGPT o1-preview; patient sex had no significant impact. Among the evaluated chatbots, ChatGPT o1-preview showed the highest alignment with oral pathologists’ reference diagnoses. These findings support the potential role of LLMs as complementary decision-support tools for interpreting oral histopathology descriptions, while highlighting substantial inter-model variability and the need for cautious implementation with continued human oversight.
AB - This study evaluated the performance of four large language model based chatbots (LLMs) (ChatGPT-4.0, ChatGPT o1-preview, Gemini, and Meta AI) as decision-support systems for interpreting histopathologic descriptions of oral lesions, assessing agreement between their s generated a suggested primary interpretation and three differential diagnoses. Outputs were categorized as Different, Similar, or Correct compared to the consensus reference diagnosis established by two board-certified pathologists. Statistical analyses included the Friedman test to compare model performance, Wilcoxon signed-rank tests for pairwise comparisons, Cohen’s κ to assess agreement, and regression analyses to evaluate the influence of age and sex. Differential diagnosis performance was also analyzed. ChatGPT o1-preview demonstrated the highest proportion of outputs concordant with the reference diagnosis (68.6%), followed by Meta AI (65.7%), ChatGPT-4.0 (59.8%), and Gemini (27.5%). In terms of agreement with oral pathologists, ChatGPT o1-preview (κ = 0.66) and Meta AI (κ = 0.63) showed substantial agreement, ChatGPT-4.0 demonstrated moderate agreement (κ = 0.57), and Gemini showed poor agreement (κ = 0.24). Increasing patient age was associated with a mild but statistically significant reduction in model performance for ChatGPT-4.0, Meta AI, and Gemini, while no significant age effect was observed for ChatGPT o1-preview; patient sex had no significant impact. Among the evaluated chatbots, ChatGPT o1-preview showed the highest alignment with oral pathologists’ reference diagnoses. These findings support the potential role of LLMs as complementary decision-support tools for interpreting oral histopathology descriptions, while highlighting substantial inter-model variability and the need for cautious implementation with continued human oversight.
KW - Artificial intelligence
KW - ChatGPT
KW - Chatbot
KW - Gemini
KW - Large language models
KW - Meta AI
KW - Oral and maxillofacial pathology
KW - Diagnosis, Differential
KW - Humans
KW - Middle Aged
KW - Decision Support Techniques
KW - Male
KW - Decision Support Systems, Clinical
KW - Young Adult
KW - Language
KW - Adolescent
KW - Female
KW - Adult
KW - Aged
KW - Pathology, Oral/methods
KW - Large Language Models
UR - https://www.scopus.com/pages/publications/105034975309
UR - https://www.mendeley.com/catalogue/cdec8c32-70c4-37cb-abc3-6589ffc0c15a/
U2 - 10.1038/s41598-026-41533-z
DO - 10.1038/s41598-026-41533-z
M3 - Article
C2 - 41760843
AN - SCOPUS:105034975309
SN - 2045-2322
VL - 16
JO - Scientific Reports
JF - Scientific Reports
IS - 1
M1 - 11272
ER -