Educational justice. Reliability and consistency of large language models for automated essay scoring and its implications

Abstract

Maintaining consistency in automated essay scoring is essential to guarantee fair and dependable assessments. This study investigates consistency and provides a comparative analysis of open-source and proprietary large language models (LLMs) for automated essay scoring (AES). The study utilized student essays, each assessed five times to measure both intrarater (using intraclass coefficient and repeatability coefficient) and interrater (concordance correlation coefficient) reliability across several models: GPT-4, GPT-4o, GPT-4o mini, GPT-3.5 Turbo, Gemini 1.5 Flash, and LLaMa 3.1 70B. Essays and marking criteria are used for prompt construction and sent to each large language model to obtain score outputs. Results indicate that the scores generated by GPT-4o closely align with human assessments, demonstrating fair agreement across repeated measures. Specifically, GPT-4o exhibits slightly higher concordance correlation coefficients (CCC) than GPT-4o mini, indicating superior agreement with human scores. However, qualitatively, it can be observed that all LLM models are not as consistent in terms of their scoring rationale/evaluation. Our study results indicate that the challenges currently faced in automated essay scoring with large language models need to be analyzed not only from a quantitative perspective but also qualitatively. Additionally, we utilize more sophisticated prompting methods and address the inconsistencies observed in the initial measurements. Despite the purported reliability of some models within our study, the selection of LLMs should be considered thoroughly during practical implementations for an AES.

https://doi.org/10.37074/jalt.2025.8.1.21
PDF

Downloads

Download data is not yet available.