Resumen
This study examines the reliability and accuracy of large language models (LLMs) for automated classroom observation using the World Bank's TEACH Primary framework. As education systems increasingly explore AI-based tools to scale teacher feedback and professional development, empirical validation of these systems is critical. Using a corpus of 12 primary classroom videos, we compared 8618 AI-generated evaluations from eight LLM endpoints against consensus-based ratings from certified TEACH experts. To account for model stochasticity, each model produced 10 independent evaluations per video–element pair. Reliability was assessed using variability and inter-rater consistency indicators, while accuracy was evaluated using error-based and concordance-based agreement measures. Results show substantial stochastic variability across repeated evaluations, with no model achieving uniformly high reliability across instructional elements. Agreement with expert ratings remained moderate at best. Importantly, reliability and accuracy did not co-vary systematically: models producing more stable scores did not necessarily align better with expert judgments, and models with stronger expert agreement often exhibited higher internal variability. In an exploratory analysis of model justifications, patterns suggest that LLMs tend to prioritize explicit verbal cues over contextual or implicit pedagogical evidence when generating high-inference judgments. These findings highlight structural limitations of current text-based AI observation pipelines and demonstrate that automated classroom observation cannot be treated as a uniform capability. The study provides empirical evidence to inform the design and validation of AI-assisted observation systems that integrate pedagogical expertise, measurement constraints, and complementary human judgment.
| Idioma original | Inglés |
|---|---|
| Número de artículo | 100612 |
| Publicación | Computers and Education: Artificial Intelligence |
| Volumen | 10 |
| DOI | |
| Estado | Publicada - jun. 2026 |
Nota bibliográfica
Publisher Copyright:Copyright © 2026. Published by Elsevier Ltd.
Huella
Profundice en los temas de investigación de 'Validating AI-generated classroom observations: Reliability, accuracy, and limits of LLM-based pedagogical judgment'. En conjunto forman una huella única.Citar esto
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver