An enormous amount of work and research has been conducted on marking in recent years, including many contributions by Ofqual to ensure evidence-based regulation. Here is a summary of what we know, and some things we’re looking to find out more about.
- Most marking for GCSEs and A levels is conducted online, using specialist marking software. Such software has a range of features which are good for quality assurance, including objective ways of monitoring marking quality such as seeds and sample double marking (which we talked about in a previous blog). It also gives better visibility of marking progress. When everything was marked on paper, exam boards could not easily know if markers were on track to meet marking deadlines until sometimes quite late. In an online system, marking progress is continually monitored and exam boards have the opportunity to intervene early to address issues of speed or quality. This helps the complex exercise of getting around 15 million scripts marked in a 3 month period.
- Most markers are teachers, and most are very experienced in both. The most common reason they start marking is to develop their understanding of how assessment works so they can better prepare their students. The vast majority of markers report positively on the quality of their training and the marking process and speak of being ‘proud’ of their marking role. We will be saying more about examiners in a subsequent blog.
- Marking agreement in England is very similar to that seen elsewhere in the world. We have compared marker agreement in 7 subjects with the research literature from around the world. And the marking in England ‘stacks up’ well against these comparators (and also looks very stable over time).
- Different question types have different levels of agreement between markers. Questions which have the best levels of agreement between markers are ‘objective questions’ such as multiple choice questions. Those questions requiring much longer responses generally have lower levels of marker agreement. This is not surprising. We can easily imagine that markers’ judgement might vary about the extent to which an essay demonstrated ‘clear knowledge’ rather than ‘detailed knowledge’ of the effects of an earthquake, say. Even after a lot of training of very experienced and knowledgeable markers we might still expect some differences of opinion rather than total agreement on the mark awarded.
- We talk about ‘definitive marks’, but there are often other plausible marks that are equally correct. We can probably imagine a situation where an essay is ‘on the cusp’ between showing ‘clear knowledge’ and ‘detailed knowledge’. If some examiners gave a mark of 7 and some a mark of 8, it would be hard to argue that, say, the mark of 8 represents ‘error’, and the mark of 7 is the only possible or legitimate mark. However, it is important to note that this is not to say that any mark is plausible or legitimate for an essay response. Exam boards and their markers should work hard to narrow the range of plausible marks, through good mark schemes and good examiner training.
- Marker agreement in different subjects reflects the different types of questions in the papers. In Ofqual reports (see figure 12), the rank order of subjects by marker agreement is essentially a reflection of the degree to which the assessment takes place through essay-style questions. Those subjects with few or no essay questions tend to have better marker agreement.
- Reporting scaled scores instead of grades may or may not help express the likely range of ‘legitimate marks’. Marks do not have to be converted into grades. We could adopt a different system for reporting qualification outcomes on results slips or certificates. Results could be expressed as, for example, scaled scores; or scaled scores with some ‘confidence interval’ to indicate other possible alternative scaled scores for each student. In other words, instead of Grade B, the student’s result might be expressed as, say, ‘173 [+/- 4]’. Some assessments around the world choose these reporting methods. Potential advantages include that it can report information that is more granular and reminds end-users of the ‘uncertainty’. It therefore might give greater interpretability about how ‘close’ two different students are in the ‘rank order’ if, for example, they are both competing for the same university course or job. And it might help prevent some of the ‘cliff-edges’ created by dividing a mark scale into larger grade categories.
Any such change would take some time to ‘bed in’, so that students, teachers, parents, universities and other end-users of qualifications understand the meaning or currency of results. And ‘cliff-edges’ would likely persist, as end-users are likely to want draw their own lines on any reported scale.
Ofqual did quite a bit of thinking around this ahead of GCSE reform in 2013, and indeed consulted on this. We found that 84% of respondents supported maintaining reporting of grades at that point in time. However, it is something we remain open-minded about and welcome continuing the debate.
- There is a system to correct marks/grades which are wrong. Where a student believes that the mark or grade is incorrect, they can request a Review of Marking or Moderation. Exam boards should not change one legitimate mark for another legitimate mark, but they must address incorrect marks. And if the student still believes the issue has not been remedied, with the help of the school or college, it is possible to make an Appeal on the grounds of marking error.
- When exam boards are not doing enough to deliver good marking and the right results, we will act to remedy this. In the last 18 months, exam boards have made a number of undertakings where we have identified breaches of our rules in regard to marking and reviews of marking.
- One way to improve marker agreement would be to just have multiple choice questions…but this isn’t necessarily what the system wants or needs. There are many decisions made when developing assessments. While multiple choice (and other objective) questions can assess many aspects of learning – including analysis, problem solving, evaluation as well as factual recall – it is less clear that they can genuinely test some other important skills such as the ability to construct an argument or show detailed analysis of an issue. On this basis, there was a clear policy steer to have extended responses to promote meaningful teaching and learning experience in the classroom.
- Reliability is not just about marking. Reliability is the idea that a student would get the same result in different (hypothetical) circumstances. For marking, we might wonder ‘would this student have received the same mark/grade if they had had a different marker?’ There will always be some natural variability, which, as discussed above, will represent ‘legitimate’ mark differences. But what about other (hypothetical) circumstances? We might also ask ‘would this student have received the same mark/grade had they taken the exam on a different day?’ for example, when it was a cooler temperature, or they were in a better mood, or they had had a better/worse night’s sleep? This is bound to cause some variability. Furthermore, we might also ask ‘would this student have received the same mark/grade if the exam had had a different selection of topics/questions also allowable by the specification?’ What if the focus was more on ‘map reading’ than ‘volcanoes’, or less on Lady Macbeth and more on Macduff or Banquo? Again, the sampling of topics and sub topics is likely to create some variation in marks (and grades) for some individual students. So, there are different factors affecting reliability, of which marking is just one.
Two things we don’t know, but we are planning on finding out.
- Which has better consistency – marking or moderation? So far, we have only talked about marking in the context of examination scripts. But of course, non-examined assessment (NEA) is also marked – by teachers – and then moderated by exam boards. We are currently conducting a series of studies to help us understand how moderation looks in relation to marking in terms of consistency. In some ways, we might expect moderation to be harder or less consistent than marking because some NEAs are quite substantial and students can produce a very large range of admissible work (think about musical compositions or extended essays on literary works of art).
- In an essay-based examined subject, just how good could marking actually ever be? Our research suggests that in maths the average probability of receiving the ‘definitive grade’ is around 0.96. This means that on average, the probability of getting the same grade as that which would have been given by the Principal Examiner is 96% (though, as above, other marks may well be entirely legitimate). For history, English language and English literature, the average figures are around 0.55 to 0.6. As discussed before, this is because of the nature of the assessment style. [And just to note, especially where a mark is just above or below a grade boundary, a mark on the other side might also be a legitimate mark, as well as a legitimate grade].
The question is, then, could history or English literature, say, ever be ‘as good as’ mathematics for marker agreement? In the next year we are planning to conduct a study on an essay-based subject, and use ‘state-of-the-art’ techniques for training, standardising and monitoring markers and see just how good it can be. As well as knowing how much better it might be, we will also know more about potential costs to the system. This will help us have an evidence-based discussion with stakeholders around the alignment between the realistic precision of particular assessments and the uses to which their grades are put. While marking can be improved, particularly in hard to mark subjects, it is unlikely ever to be perfect. So, it will continue to be important to work with others, who use examination grades to make decisions about young people’s futures, to make sure they understand how to use this information and make fair decisions.
If you would like to talk to Ofqual about any of the issues raised in this blog, please contact us at firstname.lastname@example.org.
By Beth Black, Director of Research and Analysis, Ofqual