Accuracy pecking order – How 30 AI detectors stack up in detecting generative artificial intelligence content in university English L1 and English L2 student essays

Abstract

This study set out to evaluate the accuracy of 30 AI detectors in identifying generative artificial intelligence (GenAI)-generated and human-written content in university English L1 and English L2 student essays. 40 student essays were divided into four essay sets of English L1 and English L2 and two undergraduate modules: a second-year module and a third-year module. There are ten essays in each essay set. The 30 AI detectors comprised freely available detectors and non-premium versions of online AI detectors. Employing a critical studies approach to artificial intelligence, the study had three research questions. It focused on and calculated the accuracy, false positive rates (FPRs), and true negative rates (TNRs) of all 30 AI detectors for all essays in each of the four sets to determine the accuracy of each AI detector to identify the GenAI content of each essay. It also used confusion matrices to determine the specificity of best- and worst-performing AI detectors. Some of the results of this study are worth mentioning. Firstly, only two AI detectors, Copyleaks and Undetectable AI, managed to correctly detect all of the essay sets of the two English language categories (English L1 and English L2) as human written. As a result, these two AI detectors jointly shared the first spot in terms of the GenAI detection accuracy ranking. Secondly, nine of the 30 AI detectors completely misidentified all the essays in each of the four essay sets of the two language categories in both modules. Thus, they collectively shared the last spot. Thirdly, the remaining 19 AI detectors both correctly and incorrectly classified the four essay sets in varying degrees without any bias to any essay set of the two English language categories. Fourthly, none of the 30 AI detectors tended to have a bias toward a specific English language category in classifying the four essay sets. Lastly, the results of the current study suggest that the bulk of the currently available AI detectors, especially the currently available free-to-use AI detectors, are not fit for purpose.

https://doi.org/10.37074/jalt.2024.7.1.33
PDF

Downloads

Download data is not yet available.