Machine Learning im Bildungskontext: Evidenz für die Genauigkeit der automatisierten Beurteilung von Essays im Fach Englisch

Jennifer Meyer, Thorben Jansen, Johanna Fleckstein, Stefan Keller, Olaf Köller

December 2020

Abstract

Essay writing is an important skill in both first and foreign language learning. Argumentative writing in particular is an important aspect of final examinations in upper secondary school as well as university entrance exams (e.g., TOEFL®). Despite of their importance, argumentative writing competencies have rarely been investigated empirically in the context of large-scale educational assessments. One reason is that the process of rating the essays, which is needed to obtain valid scores, is both time-consuming and expensive, requiring a large amount of well-trained human raters. To reduce the cost of scoring essays, computerized automated scoring techniques can be applied to approximate the score given by the expert human raters. For that purpose, linguistic text features can be analyzed with computer-based algorithms and then combined using machine learning techniques (i.e., gradient boosting or regression analyses) to predict human scores. The present research illustrates this approach, highlighting the potential of automated scoring methods by applying regression analyses and gradient boosting to an existing data set of students’ essays. We analyzed a sample of N = 2179 essays written by students in upper secondary schools in Germany and Switzerland (grade 11). We used the open-source software CTAP to code 173 linguistic features automatically. These linguistic features were used to predict the scores given by expert human raters trained by the Educational Testing Service (ETS). Results showed the accuracy of the prediction to be satisfactory (r = .75; percentage of exact agreement 42%) and comparable to the scores computed by a commercial software by ETS (e-rater®; r = .81; percentage of exact agreement 42%). Our study shows similar results for linear regression analysis and gradient boosting as two different strategies to predict essays scores. Opportunities and challenges of automated essay scoring in the context of foreign language assessment and its application in the school context are discussed.

Type

Journal article

Publication

Zeitschrift für Pädagogische Pychologie, 37