Let me tell you about an experiment I recently did with a few AI platforms (for full disclosure Google's Gemini and Anthropic's Claude) and asked them (them?) to evaluate student work both in the basic, commercially available platforms—a rubric, a video essay, and then something more elaborate, two semi-agentic bots I designed (Gem and Cowork) that had task-specific clarifications and deeper frontloading of interpretive contexts.
That's the right instinct...I think…
The educator community is exactly where this lands with the most force—because we are the ones being asked, with increasing urgency, to have a position on AI in assessment. And most of the conversation is stuck in the wrong place: can students use it to cheat, and how do we detect it?
I say, who cares?
Why let this ruin your day, or ruin the relationship you have with a student?
I have detractors. You can only imagine…
What I built to test this reframes the question entirely. The interesting pedagogical problem isn't AI as threat to assessment integrity—it's AI as a mirror for what assessment actually is. When two differently-structured systems evaluate the same oral analysis and reach different conclusions about student A's organisation or student B's criterion knowledge, that's not a failure of the technology. That's the technology revealing something that was always true and usually invisible: that assessment is an interpretive act, not a measurement. And now my post-structuralist Spidey senses are tingling…
For an educator audience, the demonstration has a few layers they could, and should, unpack: The first is epistemological—what does it mean to know a student performed at a 4 versus a 5? The second is methodological—what do inter-rater reliability protocols actually protect against, and what do they miss? And the third, which is the one that might genuinely shift practice, is this: if two AI systems disagree about a student's oral analysis in the same way two human examiners might, then the response isn't to trust one more than the other—it's to treat the disagreement itself as information, and to teach students to do the same with their own interpretive disagreements about texts.
That's a course in critical thinking wrapped inside an assessment methodology story. A riddle wrapped in a mystery, wrapped inside an enigma. Uh-oh…
Let me start with a small experiment.
A student delivers an oral examination. He speaks for approximately ten minutes about Wordsworth's Tintern Abbey—confidently, fluently, with a sophisticated grasp of context and an original interpretive formulation at the end that could fairly be described as the most intellectually impressive moment of any oral in his cohort.
Two assessors evaluate his performance—the typical moderation move. Both are working from the same rubric. Both are attending to the same criteria. Both are appropriately caffeinated. One gives him 17 out of 20. The other gives him 16.
This isn't a story about incompetent marking. It isn't even a story about inconsistency. It is, I want to argue, a story about what assessment actually is—and it becomes considerably more interesting when I tell you that neither of those assessors was human.
The experiment I'm describing was not designed as an experiment. It emerged from a practical workflow that many teachers working with AI tools will recognise: a pipeline (a stack) that takes student video submissions, transcribes them automatically using Whisper (OpenAI's speech-to-text model), passes those transcripts to a large language model for assessment against a rubric, and generates a formatted feedback document.
The second assessor was the same underlying model—Claude—but working manually, in conversation, having watched the video, and then read a transcript of it.
The discrepancies, when we laid them out side by side, were illuminating. For the student in question—let's call him Henry—the pipeline gave him Criterion A: 4/5, missing the interpretive sophistication of what he'd actually argued. The manual assessment gave him Criterion A: 5/5, with a specific note about his coinage of the phrase "reconstructive education" as an original critical formulation. The pipeline couldn't find that phrase because Whisper hadn't heard it clearly.
A different student—Jack—lost two full marks to the pipeline, including a C criterion score of 3 instead of 4, because his structural approach was read as "recursive" rather than recognised as the sustained, argument-driven organisation it actually was. And perhaps most instructively: the pipeline penalised a third student for apparently misusing the term "diction"—writing that he should "watch for slips such as 'addiction' for 'diction'"—when in fact the student had said "diction" perfectly correctly. Whisper had simply misheard him, and the model downstream had taken the transcript at face value, turning a transcription error into a language penalty.
Here's the thing. We could stop the story there and draw the obvious conclusion: AI assessment pipelines are unreliable, students were disadvantaged, the technology failed. That conclusion is not wrong. But it is, I think, incomplete—and stopping there means missing something genuinely important about what this experiment revealed.
Assessment theory has had a complicated relationship with the concept of reliability for a long time.
The dominant paradigm in assessment for much of the twentieth century was psychometric: the idea that good assessment is reliable assessment, meaning that different assessors evaluating the same performance should reach the same conclusion. Inter-rater reliability—the statistical measure of agreement between assessors—became the gold standard of assessment quality. If two markers disagree, one of them is wrong, or the rubric is ambiguous, or the training was insufficient. The goal was convergence. Statistical validity becomes a knowing joke at the pub…and then we're all talking about statistics, validity scores, Pearson point coefficients.
Carol Gipps, in her landmark 1994 work Beyond Testing, argued that this paradigm had fundamentally misconceived what educational assessment is doing. Psychometric reliability, she suggested, was borrowed from a measurement model designed for stable, objective phenomena—lengths, weights, temperatures—and imposed on something categorically different: the evaluation of a human performance that is irreducibly interpretive, contextual, and relational. Assessment, Gipps argued, is not measurement. It is judgment. And judgment, unlike measurement, is not improved by pretending it is objective.
My father, a nuclear physicist, would identify this discrepancy as the phenomenological differences between soft science and hard science.
D.R. Sadler's influential 1989 paper on formative assessment identified what he called "guild knowledge"—the tacit, accumulated understanding of quality that experienced assessors develop and that cannot be fully articulated in a rubric. It is what I call the 'black art' of IB assessment, broadly, comprehensively. Expert assessors, Sadler observed, don't simply apply criteria mechanically. They attend to the whole performance, they notice what is remarkable, they situate what they see within a mental model of the range of possible performances, and they make a judgment that is, in the end, irreducibly interpretive. The rubric is a scaffold for that judgment, not a replacement for it.
More recently, research in oral and multimodal assessment has pressed this point further. Gunther Kress and others working in multimodal communication theory have argued that performance-based assessment is evaluating something fundamentally different from what written transcripts can represent: the voice, the pacing, the confidence of a claim delivered in real time, the way a speaker's intonation can make the same words carry more or less interpretive weight. When Henry says "reconstructive education" in a way that signals he has arrived at the formulation rather than rehearsed it, that arrival is part of what is being assessed. A transcript cannot carry it.
None of this literature was written with AI assessment in mind. But it might as well have been.
Here is the reframe I want to offer: the disagreement between our two AI assessors is not a bug in the technology. It is the technology doing something that human assessment has never managed to do quite so visibly—making the interpretive gap between reading a performance and attending to a performance legible.
When two human examiners mark the same oral examination and reach different conclusions, we have institutional mechanisms for handling that. We average the scores. We have a senior examiner adjudicate. We run calibration sessions. We train toward convergence, building elaborate algorithms of statistical validity. All of these mechanisms are designed, at some level, to suppress the disagreement rather than examine it—because in an examination context, divergence is a problem to be resolved rather than information to be used.
What happens when the disagreement is between a pipeline and a practitioner? Suddenly we have to look at why they disagree. And when we do, we find something instructive: the pipeline failed Henry not because its rubric was wrong or its training was insufficient, but because it was working from a degraded representation of his performance. It was reading a transcript. The practitioner was attending to a voice.
This distinction—between reading and attending—is, I would argue, precisely what Sadler's "guild knowledge" is about. It is what Gipps meant when she said assessment is judgment, not measurement. And it is what the IB oral examination is designed to test: not a student's ability to produce an analyzable text, but their capacity to think interpretively in real time, in a register that is simultaneously analytical and personal, about a work of literature that requires both.
The pipeline didn't fail because AI can't assess. It failed because the wrong version of AI was asked to assess the wrong representation of the thing being assessed. That's a design problem. And design problems are solvable.
I want to be direct about the position I'm arguing from, because I think it matters for how educators engage with this question.
The dominant conversation about AI in assessment is structured around threat: students using AI to evade assessment, AI producing outputs that undermine academic integrity, the possibility that assessment becomes meaningless in a world where language models can generate plausible responses to any prompt. These are real concerns. I'm not dismissing them.
But the assumption embedded in the threat narrative is that assessment, as currently practised, is basically right—and that AI is a force that corrupts it. I want to propose the opposite assumption: that assessment, as currently practised, has always had profound epistemological problems that we have managed through convention rather than resolved through inquiry, and that AI—used thoughtfully—might be the most powerful tool we have ever had for surfacing those problems and thinking about them honestly.
Inter-rater reliability has always been a proxy for something we couldn't quite measure. Two examiners agree; we call the result valid…no need for z-scores and SD calculations. But agreeing examiners might both be missing the same thing—the thing that only reveals itself to the assessor who is genuinely attending, as opposed to the one who is efficiently processing. The pipeline's failure to notice Henry's "reconstructive education" formulation is not fundamentally different from a tired examiner's failure to notice it. The difference is that the pipeline makes the failure visible in a way that a tired examiner, producing a plausible-sounding rationale, does not.
That visibility is valuable. It should be welcomed, not suppressed.
I don't think the conclusion here is "use AI for assessment" or "don't use AI for assessment." I think the conclusion is considerably more interesting: use AI for assessment in ways that generate productive disagreement, discourse even, and then teach students—and teachers—to examine that disagreement.
Imagine showing a student two assessments of their oral: one generated by a pipeline from a transcript, one produced by a practitioner who watched the video. Ask the student: why do these differ? What does the pipeline miss? What does the practitioner notice that the transcript can't carry? What does that tell you about what you were actually doing when you were doing your best?
This is not a failure-state scenario. It is one of the richest assessment conversations a student and teacher could have. It is, in miniature, a lesson in hermeneutics—in the idea that texts (and performances, and transcripts of performances) do not carry stable meanings that a sufficiently trained reader will reliably extract. Meaning is made in the encounter between a reader and a text, shaped by what the reader attends to, what they bring, what they are listening for. Gadamer called this the "fusion of horizons." Bakhtin called it dialogue…love that guy…we should read him more often. Assessment theorists have been circling it for thirty years without quite naming it.
AI has named it. Accidentally, in the gap between a pipeline and a practitioner. But named it nonetheless.
If you are in a room with other educators talking about AI and assessment, I want to leave you with one provocation.
The conversation we need to be having is not: how do we stop AI from undermining assessment?
It is: what does AI reveal about what assessment has always been?
And the answer, I think, is this: assessment has always been an interpretive act, performed by a situated reader, shaped by what they attend to and what they miss, structured by conventions that produce reliability without guaranteeing validity, and improved—when it is improved—by the quality of attention that an experienced, engaged practitioner brings to a human performance.
AI doesn't change that. It makes it visible.
And visibility, in education, is almost always the beginning of something better.
This piece emerged from a classroom assessment project at the University of Toronto, Cultural Studies, English Literature Oral Analysis submissions on Wordsworth's "Tintern Abbey." The observations about inter-rater discrepancy are drawn from a comparison of pipeline-generated and practitioner-generated assessments of the same student performances. No student names have been used in the public version of this piece.
Key references: D.R. Sadler, "Formative Assessment and the Design of Instructional Systems" (1989); Carol Gipps, Beyond Testing: Towards a Theory of Educational Assessment (1994); Paul Black & Dylan Wiliam, "Inside the Black Box" (1998); Gunther Kress, Multimodality: A Social Semiotic Approach to Contemporary Communication (2010); Hans-Georg Gadamer, Truth and Method (1960/1975).