Skip to main content

Blog The Ofqual blog


Exploring the potential use of AI in marking

Posted by: , Posted on: - Categories: A levels and GCSEs, Other

On a fairly regular basis, we hear about some new, possible application for Artificial Intelligence which can potentially supplement or replace complex human judgement – the sort of judgements which can take humans a long time. Just last week in the news the results of a research study on AI in breast cancer screening were reported.  This study reported that AI was as good as the current double-reading system of two doctors in the NHS; and it was actually superior at spotting cancer than a single doctor. The research was just that - research - and had not been used in an actual clinical setting.  But it gives the sector an idea of the potential of this technology in this context.

At Ofqual, we want to understand whether there might be a role for AI in marking. Like reading x-rays, complex human judgement is required to make a judgement based on a complex set of information. Markers read student responses, and then interpret and evaluate them using a mark scheme to determine a mark for the student.  While we know that the quality of marking in England in GCSEs and A levels is generally good and compares well internationally, we are interested in opportunities for improvement – not by replacing human judgement, but by using AI to support markers in the role they play.

We are particularly interested in whether using AI as a second marker or as a way of monitoring marking might help improve marking.  Can AI be more effective in identifying inconsistent markers?  Furthermore, can AI be more effective in spotting an erroneous mark from an otherwise good and consistent marker?  Because marking can be a very demanding task, it is to some extent inevitable that there are occasionally mistakes (and underlines the importance of the post-results reviews and appeals system).

To help us understand the potential for AI, we are conducting some research which includes an ‘AI competition’.  Essentially, we will get several thousand student responses to an essay (a particular GCSE English language essay question), and these will be marked by human markers. All AI systems involve ‘training’ – in other words, the AI system is given a large number of examples with the ‘right answer’ or the best human judgement, which it will then attempt to replicate.  It is important that the training examples use the ‘best’ human judgement, because AI systems are only as good as the data put into them. Therefore, in our study we will use the most senior markers, and each essay will be marked multiple times to ensure the marks do not reflect error.

These responses will then be used for a competition so that many individuals and organisations with expertise in AI can train an AI system to mark similarly to the training set.  We can then test these AI systems on another set of essays (for which we know the marks, but the AI systems do not).  We very much hope this competition will help stimulate and identify the very best practice in this field.

The results from this competition will help us undertake further subsequent research work – for example, modelling the impact of AI as a second marker or as a marker monitoring system.

It is early days in terms of looking at AI in marking, but it is important to take some first steps on this in England, by beginning this exploratory research.  If there are genuine potential improvements, ways which might enhance marking quality, of course we want to know, so we can encourage the system to adopt such practices safely. Similarly, we want to have a deep understanding of the potential risks in operating such technology in our high stakes examinations.

Right now, we are recruiting schools to provide students’ essay responses for mock tests to a particular past question – we will give schools examiner marks and annotations and for these essays (which should be very useful to schools for teaching and learning) in return for being able to use the responses in the research and AI competition. Altogether, we are hoping for around 3,000 student responses. If any schools are interested in taking part, please email us at to express an interest and for more information.

This new project is exciting. Any future use of AI is likely to take some time and a lot of testing.  We are not going to suddenly see AI being used at scale in marking high profile qualifications overnight.  But we hope in the very least this work will help the sector to think about other ways and innovations to improve marking quality and the delivery of qualifications more broadly to ensure each learner’s work gets the mark it deserves.

Sharing and comments

Share this page