Accuracy You Can Trust

We believe in transparency. Rather than making vague claims, we publish real accuracy data — tested against official examiner marks, and against a public benchmark anyone can check — so you can decide for yourself.

Tested on official IGCSE marks & a public double-marked GCSE benchmark

Last updated 17 July 2026

The question nobody else asks

0.85 vs 0.77

On 60 GCSE English essays marked independently by two qualified examiners, those examiners agreed with each other at a kappa of 0.77. Graded Pro agreed with them at 0.85. On essay marking, our AI sits closer to the examiners than they sit to each other.

How We Compare

Most marking tools compare themselves to a single examiner. That tells you very little, because examiners disagree with each other — a lot. So we tested against a benchmark where every essay was marked twice, independently. That gives an honest yardstick: how close is a second human?

On 60 double-marked GCSE essays	Graded Pro	A second examiner*
Agreement with the examiners (kappa)	0.85	0.77
Average difference from the examiner mark	2.8 marks	4.6 marks
Essays 8+ marks adrift	1 in 60	12 in 60
Marking bias	+0.7 marks	—

*Essays are from the Medly marking benchmark (Fox et al., 2026), a public dataset of real GCSE mock responses each marked independently by two qualified examiners, released under CC BY 4.0. The 60 essays are two 40-mark extended writing tasks, 30 responses each, spanning the full mark range. “A second examiner” is the agreement between the two human examiners on the same scripts. Graded Pro marked every essay from the question and mark scheme alone, with no examiner marks visible, using our standard production settings. Our marking bias of +0.7 marks means we are very slightly more generous than the examiner average. Anyone can download the dataset and repeat this.

Results by Subject

All results are from real examination papers, compared against the actual marks awarded by the official examiner. The only inputs were the students' work and the official mark scheme — nothing was adjusted or modified.

∑

Mathematics

IGCSE 0580 · Papers 2 & 4 · 561 questions

Kappa 0.86

Exact match 84%

Within ±1 mark 97%

Within ±2 marks 99.5%

Average error 0.19

Handwritten scripts Yes

English Language

GCSE essays · 60 double-marked · 40 marks each

Kappa 0.85

Two examiners score 0.77

Average error 2.8 marks

Two examiners differ by 4.6 marks

Top-band accuracy 3.0 marks

Bias +0.7

A note on kappa. It is the measure exam boards use for marker agreement, and it can be inflated by pooling questions of very different sizes together — mixing 1-mark answers with 40-mark essays makes any marker look better than it is. We calculate kappa per question and then average, which is the stricter and more standard method.

Structured Questions

Our system is at its strongest on questions with defined correct answers — the kind that make up the majority of assessments. Across 609 structured questions in maths and English:

85%

Marks identical
to the examiner

609 questions

97%

Within ±1 mark
of the examiner

Across subjects

0.19

Average error
per question

Marks

Largest error
on any question

Marks

Whether it's a 1-mark calculation or an 11-mark multi-step problem, the AI consistently matches professional marking standards — and the maths papers above were marked from handwritten scripts, not typed answers.

Extended Writing & Essays

Levelled questions — where markers use band descriptors to judge quality rather than tick off correct answers — are the hardest thing in marking, for anyone. Here is what the double-marked data actually shows.

Essay marking is genuinely contested. On the same 40-mark essay, two qualified examiners differed by 4.6 marks on average. On 12 of the 60 essays they were 8 or more marks apart. On one, they were 20 marks apart. This is not a criticism of examiners — it is the nature of judging writing.
We sit inside that spread. Graded Pro averaged 2.8 marks from the examiner consensus, and was 8+ marks adrift on 1 essay in 60, against the examiners' 12 in 60.
Including at the top. On the strongest essays — those the examiners placed at 32/40 or above — we averaged 3.0 marks from consensus. On several essays where one examiner awarded full marks and the other awarded 34, we landed exactly on the midpoint.
We are marginally generous, not harsh — a bias of +0.7 marks against the examiner average.

What This Means For You

AI marking is not a replacement for your professional judgement — it's a tool that handles the heavy lifting so you can focus on what matters.

Where the AI is strongest

Short-answer questions, calculations, retrieval tasks, and structured responses across all subjects. On these the AI matches the examiner on roughly 85% of questions and lands within a mark on 97% — reliable enough to use as a first pass and review by exception.

Where you should review

Extended writing. Not because the AI is unreliable — on double-marked essays it sits closer to the examiners than a second examiner does — but because essay marks are genuinely contested and carry the most weight for your students. Moderate a sample, as you would with any marker, and look at anything close to a grade boundary.

Not Just Exams

Our accuracy benchmarks are based on formal examination papers, but Graded Pro is built for everyday marking across all types of student work. The same AI that matches examiner standards on exam scripts delivers consistent, rubric-linked feedback on:

Homework — weekly assignments marked and returned the same day, with actionable next steps
Classwork and in-class tasks — quick, consistent feedback while the learning is still fresh
Termly tests and mock exams — full cohort marking with detailed breakdowns by question
Coursework drafts — formative feedback that helps students improve before final submission
Past paper practice — students get instant, exam-standard feedback on every attempt

Wherever there's a rubric or mark scheme, Graded Pro delivers accurate, detailed feedback — whether the stakes are high or the goal is simply helping students learn from their work.

Our Commitment

We continuously test and improve our marking accuracy. We don't claim perfection — no marker, human or AI, achieves that, and the data on this page shows just how far from perfect human marking is on the hardest questions.

What we promise is that we publish what we find, including when it is unflattering. We revised this page downwards in July 2026: our previously published kappa of 0.97 used a calculation that pooled 1-mark answers with 40-mark essays and flattered the result. The honest figure is around 0.85. We also previously warned that we under-marked the strongest essays; when we tested that against double-marked data rather than a single examiner, it did not hold, and we removed it.

Every time we change the underlying model, we re-test every paper. Where our sample is small, we say so. Where you can check our work yourself, we tell you where to find the data.

See For Yourself

Start Free Trial

No credit card required

Choose a plan that fits your needs and budget

Our Pricing Plans

Teacher Free

Try Graded Pro risk free

150 Credits Free

50 FREE Credits / Month - Max 50

Full Graded Pro Toolkit

Email Support

Single User

Best Value

Teacher Pro

Best Choice for Busy Teachers

$25

6000 Credits

No Expiration Date

Access to all Features

Priority Email Support

Single User

School Account

Whole School or Departments

Custom Quote

Pooled Credits

No Expiration Date

All Features + User Dashboard

Dedicated Manager

Multiple Users

The number of credits needed to mark each student's work depends on the amount of text and images submitted, whether the work is handwritten or typed, and the level of feedback required.

Basic task

1¢

~3 credits

In-depth task

~3¢

8–10 credits

Full exam marking

~10¢

~30 credits

Weekly grading for 125 students (e.g. 5 classes) requires around 4,500 credits per month.