AI as an evaluator — Jeanmarcos

The arrival of language models in academia was discussed, almost immediately, from a single angle: how to stop students from using them to do their assignments. But the phenomenon has two faces, and the second is rarely examined with the same rigor. Generative AI didn’t only change what the evaluated can do; it also changed what the evaluator does, and what they believe they can do.

This piece starts from an uncomfortable observation: in many courses, students are forbidden from using AI to produce their work while, at the same time, some faculty turn to AI-based tools to decide whether those submissions were machine-generated, or even to assign a grade. An asymmetry takes shape: the very technology deemed illegitimate in the hands of the evaluated is assumed reliable, neutral, and sufficient in the hands of the evaluator.

What the evidence says

The recent literature is consistent, and discouraging for anyone expecting a guarantee from these tools.

On detection, a systematic test of fourteen tools concluded they are neither accurate nor reliable¹: they tend to classify text as human and are fooled by transformations as simple as paraphrasing or translating. Theoretically, as models improve, the text they produce becomes statistically indistinguishable from human writing, so reliable detection is, in the limit, infeasible². On top of the accuracy limits sits a fairness one: detectors wrongly flag as “AI-generated” more than half of the essays written by non-native English speakers³, penalizing a style rather than a behavior. The most telling acknowledgment came from the industry itself: OpenAI retired its text classifier in 2023 for low accuracy⁴. And the consequences aren’t hypothetical: there are students unjustly accused on the basis of a false positive they cannot refute⁵.

On the LLM as a judge, when the model doesn’t detect but grades, well-documented systematic biases appear: position bias (what comes first is preferred), verbosity bias (length is rewarded), and self-preference bias (text in the model’s own style is favored)^6, 7. A judge whose verdict shifts with order, length, or style isn’t measuring only what it claims to measure.

And there’s a decisive technical property the pedagogical debate tends to overlook: these systems are probabilistic, not deterministic. They generate text by sampling, so the same query on the same document can produce different responses on each run⁸, and that non-determinism can persist even with randomness nominally switched off⁹. In assessment, this translates into variance: a single run can mislead if it isn’t accompanied by the variance across multiple runs¹⁰.

In response, the bodies that have taken the problem seriously agree on shifting the focus from surveillance to redesign: UNESCO calls for a human-agency-centered approach¹¹, and regulators like TEQSA¹² and QAA¹³ acknowledge that detectors don’t guarantee integrity and recommend rethinking assessment rather than merely banning.

The problem: the evaluated–evaluator asymmetry

The ban on the evaluated has a reasonable justification: if an activity’s goal is to develop a competence (to argue, to code, to prove, to synthesize), handing that task to a generative tool empties the activity of its formative meaning.

The asymmetry shows up on the other side of the desk. To verify compliance with that ban, some faculty turn to AI detectors, and in some cases use an LLM to assign the grade directly or write the feedback. The result is a paradox: the same class of technology held illegitimate when used by the evaluated is assumed valid, sufficient, and neutral when used by the evaluator. Defending a sanction on the basis of an instrument that its own makers and the literature consider unreliable shifts the cost of the evaluator’s uncertainty onto the evaluated.

The most visible consequence is a rise in structural distrust. The evaluator, suspecting that any submission could be AI-generated, adopts surveillance by default; the evaluated, facing a false positive they can’t technically refute, perceives the process as arbitrary. The educational relationship, which needs a minimum of trust to work, is strained from both ends.

A larger question surfaces underneath: if a take-home activity can be solved indistinguishably by an LLM, what evidence of learning does its submission really provide? The problem isn’t only detection, it’s the validity of assessment as a device for certifying learning.

Ethical and epistemic dimensions

Delegation, validity, and responsibility. A grade is an act with institutional authority and material consequences for a person. If the grade comes from a model’s output, who answers for it? Delegation doesn’t remove the evaluator’s responsibility; at most it hides it. And the validity guarantees are weak: detectors with false positives and negatives, judges with biases. Resting a high-stakes decision on such an instrument, without cross-checking it against human judgment, offers a guarantee the instrument can’t sustain.

Non-determinism. This is perhaps the least discussed and the most decisive point. Running the same review N times on the same document can yield different results each time: different verdicts, different grades, different justifications. It’s not an anomaly fixed by a better prompt; it follows from how these systems work. Whoever doesn’t know this grants a random process the authority of a deterministic one.

The internal contradiction. The argument that bans the evaluated from using an LLM (that delegating the cognitive task to the machine empties the learning) applies with equal force to the evaluator who delegates their own task of judging to the machine. You can’t coherently hold that the tool invalidates the student’s work and, at the same time, validates the instructor’s.

Opacity and verifiability. The proprietary LLMs used to evaluate are, to whoever uses them, black boxes: their data and weights are unknown, and their internal behavior can’t be audited. This isn’t a subjective impression: indices that measure the transparency of large models report low, uneven levels¹⁴, and the specialized literature advises against using black boxes for high-stakes decisions, exactly the category a grade belongs to¹⁵. To this is added the argument from authority (“my tool is valid because I pay for it or because it was recommended”), which is evidence of nothing: price and prestige aren’t measures of accuracy. And when the processing happens in an online service, the evaluator has no way to verify what goes on at the other end. Pushed to its logical extreme, they couldn’t rule out that, instead of a substantive analysis, the service inserts an artificial delay and returns a score from a random number generator.

Without verifiability, trusting the result is an act of faith, not an epistemic guarantee. An evaluator who can’t inspect the process is in no position to answer for it.

Institutional inaction and defensive responses. Between high-level guidance and classroom practice lies a gap: without clear criteria on which tools may be used, with what safeguards and under what evidence thresholds, each evaluator resolves a structural problem on their own. From there come defensive responses with a fairness cost: deliberately convoluted activities to have “something” to ground a suspicion, or inflated demands on the assumption that all students will use AI. Whoever doesn’t use it, out of conviction or lack of access, is measured by a yardstick calibrated for a scenario that isn’t theirs. Assessment stops measuring learning and starts measuring, in part, access to technology.

Should we rethink assessment?

The defensive responses share a trait: they try to preserve an inherited assessment model by adding layers of surveillance or difficulty. It’s worth asking whether the problem isn’t the model itself.

Much of the problematic practice springs from a literacy gap: AI tools are used to judge others’ work without understanding what they do or what guarantees they offer. Training faculty appears as a precondition for any responsible use, not as mere tool drilling but to internalize three ideas the evidence makes unavoidable: that detectors aren’t reliable, that LLM-based judges have systematic biases, and that their outputs are probabilistic and variable.

The underlying question is whether instruments designed for a world without generative AI keep their ability to certify learning. The take-home case is illustrative: an activity an LLM solves indistinguishably stops providing reliable evidence of what the student can do¹⁶. None of this means such activities have no value as practice, but it does question their use as an instrument of certification. The alternative the literature points to isn’t instructions a student can ignore, but structural changes in assessment design¹⁷, aimed at developing the student’s own evaluative judgment¹⁸.

And one question stays open: although the diagnosis is widely shared, the transition from surveillance to redesign moves slowly. Redesign demands time, training, and resources; in the meantime, each instructor improvises individual defenses whose effects on fairness and trust are already known.

Closing reflection

The thesis is a single one: the debate concentrated on one side of the desk and left the other in the shadows. Restoring the symmetry isn’t about reversing suspicion, but about reinstating a requirement that holds for everyone, that decisions with consequences for people rest on verifiable guarantees and not on assumptions.

The evidence authorizes no condemnation, but no complacency either. Detectors aren’t reliable and penalize legitimate styles; models used as judges carry biases; their outputs are probabilistic; and their inner workings are, to whoever uses them, a black box. None of these facts on its own forbids using AI in assessment, but together they impose a modest, firm conclusion: these tools can’t take the place of human judgment; at most they can accompany it under conditions of transparency and cross-checking.

It’s worth keeping in view whoever stands at the end of the chain. For the evaluated, an arbitrary grade or an accusation they can’t refute doesn’t end in the classroom: it goes on their record and shapes their path, access to scholarships, to graduate study, to employment. To that cost is added demotivation: when a student senses that assessment no longer measures their effort but the output of an opaque system they can’t question, its formative function collapses.

Trust, the material the educational relationship is made of, isn’t restored with automated suspicion, but with processes that either party can understand, question, and, when it comes to it, refute. That, and not detection, is probably the task that remains.

If you want to see the black-box argument made tangible, open the tool: it’s a detector that detects nothing, with its code in plain sight.

References

Weber-Wulff, D., et al. (2023). Testing of detection tools for AI-generated text. International Journal for Educational Integrity, 19(1), 26. doi.org/10.1007/s40979-023-00146-z
Sadasivan, V. S., et al. (2023). Can AI-Generated Text be Reliably Detected? arXiv:2303.11156. arxiv.org/abs/2303.11156
Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023). GPT detectors are biased against non-native English writers. Patterns, 4(7), 100779. doi.org/10.1016/j.patter.2023.100779
OpenAI (2023). New AI classifier for indicating AI-written text (discontinued July 20, 2023). openai.com
Klee, M. (2023). She Was Falsely Accused of Cheating With AI — And She Won’t Be the Last. Rolling Stone. rollingstone.com
Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. arxiv.org/abs/2306.05685
Shi, L., et al. (2024). Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge. arXiv:2406.07791. arxiv.org/abs/2406.07791
Renze, M., & Guven, E. (2024). The Effect of Sampling Temperature on Problem Solving in Large Language Models. Findings of the ACL: EMNLP 2024. doi.org/10.18653/v1/2024.findings-emnlp.432
He, H., & Thinking Machines Lab (2025). Defeating Nondeterminism in LLM Inference. thinkingmachines.ai
Biderman, S., et al. (2024). Lessons from the Trenches on Reproducible Evaluation of Language Models. arXiv:2405.14782. arxiv.org/abs/2405.14782
Miao, F., & Holmes, W. (2023). Guidance for Generative AI in Education and Research. UNESCO. doi.org/10.54675/EWZM9535
Lodge, J. M. (2024). The Evolving Risk to Academic Integrity Posed by Generative AI: Options for Immediate Action. TEQSA. teqsa.gov.au
QAA (2023). Reconsidering Assessment for the ChatGPT Era. qaa.ac.uk
Bommasani, R., et al. (2023). The Foundation Model Transparency Index. arXiv:2310.12941. arxiv.org/abs/2310.12941
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1, 206–215. doi.org/10.1038/s42256-019-0048-x
Kofinas, A. K., Tsay, C. H.-H., & Pike, D. (2025). The impact of generative AI on academic integrity of authentic assessments within a higher education context. British Journal of Educational Technology, 56(6), 2522–2549. doi.org/10.1111/bjet.13585
Corbin, T., Dawson, P., & Liu, D. (2025). Talk is cheap: why structural assessment changes are needed for a time of GenAI. Assessment & Evaluation in Higher Education, 50(7), 1087–1097. doi.org/10.1080/02602938.2025.2503964
Bearman, M., Tai, J., Dawson, P., Boud, D., & Ajjawi, R. (2024). Developing evaluative judgement for a time of generative artificial intelligence. Assessment & Evaluation in Higher Education, 49(6), 893–905. doi.org/10.1080/02602938.2024.2335321