Large Language Model–Generated Tutoring Responses Receive Higher Ratings but Are Penalized by Source-Based Evaluation Bias
Ola Ozernov-Palchik, Fabio Catania , John D. E. Gabrieli, Satrajit S Ghosh
Identifiers and access
- DOI
- 10.31234/osf.io/stek7_v1
- Cited by
- 0
Key findings
In a factorial experiment with adults and K–12 educators, identical tutoring responses were rated as more effective when blindly attributed to a human than to AI, even though LLM-authored responses actually outscored human-authored ones, revealing a robust source-based bias against AI in tutoring.
Abstract
Source: openalex
Large language models (LLMs) are increasingly positioned as scalable providers of individualized tutoring, yet their pedagogical quality and social reception in authentic K–12 instructional contexts remain understudied. We conducted a factorial experiment in which independent participant samples evaluated identical tutoring responses under different source framing conditions. Adults and K–12 educators evaluated tutoring responses drawn from a randomized controlled trial with third- and fourth-grade students. Human-authored responses were produced by trained tutors during real instructional sessions; matched responses were generated by an LLM from the same conversational contexts. Responses varied in true authorship (human vs. LLM), and separate participant groups evaluated identical response sets told that responses were generated either by humans or by AI. When authorship was held constant, LLM-generated responses were rated as more effective, engaging, and responsive than those written by human tutors. However, identical responses received lower evaluations when attributed to AI, revealing a robust source-based evaluation bias. Computational linguistic analyses showed that AI and human responses were semantically similar but stylistically distinguishable, and that features most strongly associated with higher ratings were largely independent of those differentiating sources. These findings provide an ecologically grounded account of how LLM-generated instructional content is evaluated: LLM responses can match or exceed human responses on perceived pedagogical dimensions while being systematically penalized when attributed to AI.
Topics
- ml-nlp-knowledge
- child-development-education
Lab authors
This record was curated from the lab's CV, NCBI MyBibliography, and OpenAlex. See PROJECTS.md for how to add or correct an entry via a pull request.