Is this the future of assessment?

Calls for exams to be scrapped have grown louder in the wake of the Covid-19 pandemic. But rather than getting rid of exams, we may just need to reimagine them, says Daisy Christodoulou
14th December 2022, 5:00am
Exams

Share

Is this the future of assessment?

https://www.tes.com/magazine/teaching-learning/secondary/future-of-assessment-onscreen-exams-no-grades-ai

In England, a war is brewing against exams. 

Earlier this year, several former education ministers told the Times Education Commission that GCSEs had had their day. Lord Baker, the education minister who introduced GCSEs, said the exams were “unnecessary”, while Lord Blunkett said they were “pretty well dead”. David Miliband, meanwhile, called them a “relic”, suggesting they be replaced by modular coursework. 

They are not alone in their concerns. In August, a report from the Tony Blair Institute called for policymakers to scrap GCSEs on the grounds that “summative, closed-book exams at the end of five years of secondary learning are, on their own, a very poor way of measuring talent”.

The disruption caused by Covid is the driving force behind many of these renewed calls to end exams. Peter Hyman, the co-founder of Rethinking Assessment, has said we shouldn’t “revert to flawed pre-Covid exams” but instead use this opportunity for “radical change”. Similarly, the EDSK education think tank says that we should use the “unprecedented event” of the pandemic as “a rare opportunity to consider how we can do things better in future”. It recommends abolishing high-stakes GCSEs and replacing them with low-stakes online assessment.

But are exams the problem they are made out to be? And has Covid really shown them to be a “relic”? After all, the huge disruption to the system in 2020 and 2021 was not caused by exams but by their absence. It’s striking that in 2022, when exams returned, there were far fewer controversies. In fact, I’d argue that the lesson of the past three years is that we need exams because it’s exceptionally difficult to allocate grades fairly without them.

Many non-exam assessment techniques introduce grey areas and ambiguities into the assessment process that are vulnerable to being exploited. Teacher assessment is subject to human biases. Coursework and continuous assessment lead to essay mills and unfair help from parents. In 2020 and 2021, some evidence suggested that independent schools had inflated their teacher-assessed grades, backing up large-scale studies from researchers such as Simon Burgess and Tammy Campbell that show poorer students do better on exams than on teacher assessment.

Of course, none of this means that our current assessment system is perfect. There are plenty of ways we could improve it, and this article will outline some of them. But it starts from the premise that we should keep exams, and that we should be looking to reform, not abolish them.

Yet any big reforms should not take place immediately. As Sam Freedman explained in Tes a few months ago, now is not the time for huge upheaval. The Covid disruptions happened not long after a major exam reform that is still bedding down. Teachers are still mastering relatively new syllabuses, and innovations such as the National Reference Test need more time to show their worth. 

Still, given the time it takes to develop and deploy new approaches, and conduct the research that will be needed to inform them, it is worth thinking now about how we could improve the system. 

Here are three promising areas to look at.

Exams

On-screen assessment

For people working outside of schools, on-screen assessment might feel like a no-brainer. We increasingly live our lives and do our work through a screen, so why can’t we do exams that way, too?

In fact, some exam boards are already trialling on-screen assessment. It is clear that they will have to solve difficult technical and logistical challenges if they are to make this work at scale, but there are also important pedagogical implications that we need to think about. 

First, we need to be more aware of “mode effects”. This is the term for the way that students respond differently to the same question depending on the mode it’s presented in. You might think that such effects would be fairly trivial but they aren’t. Copying over an exam question from paper to a screen can make a huge difference. 

One of the most striking findings in the mode effects literature is from the researcher John Jerrim. He analysed the results of more than 3,000 students in Germany, Ireland and Sweden, who had taken the 2015 Programme for International Student Assessment tests in reading, maths and science. The students were randomised into two groups. One group took the test on paper; the other took it on a computer. The paper-based group achieved a full 20 scaled-score points better than the computer-based group. That is the equivalent of about six months of additional schooling - a big difference, and one that even Jerrim was surprised by.

Of course, you could argue that if everyone is to do exams on screen, mode effects don’t matter too much. Perhaps students will do worse on screen than on paper, but it’s the same for everyone. 

However, this brings us to the second important point, which is that assessment has a significant impact on classroom practice through a process known as “backwash”. If exams go on screen, it is possible that many more students will start using laptops and devices in class, and will spend less time using a pen and paper.   

Would this be a good thing? Possibly not. One plausible reason why students do worse in on-screen exams is that pen and paper are useful thinking aids. In fact, research has shown that when we read a text on screen, we scan and scroll far more than when we read it on paper, meaning that we spend less time reading in an in-depth, concentrated way. 

When you translate the same exam from paper to screen, it is no longer exactly the same exam

Similarly, when we handwrite notes, we abbreviate and summarise far more than when we type, and this helps us to remember more of our notes. It’s striking that many adults still take paper notes in meetings, despite the ubiquity of devices in the workplace. And for younger children, there’s evidence that learning to write letters by hand makes it easier for them to recognise letters, helping to improve their reading attainment. 

The key finding from this research is that when you translate the same exam from paper to screen, it is no longer exactly the same exam. Students will engage with such exams differently, and if their classroom experience ends up involving more screens, too, that will change the nature of the assessment further. 

So, does this mean we can forget the idea of on-screen assessments? Well, in what I have written so far, I’ve assumed that we would create on-screen assessments that mimic our current paper assessments, an approach that is often nicknamed “paper behind glass”. But of course, that’s not the only option.  

Instead of translating paper exams into on-screen exams, we could create assessments that simply couldn’t exist on paper. For example, we could create interactive question types that blend video, audio and text and that require students to synthesise information from many different sources. We could also make more use of adaptive assessment - this is when students will see different questions depending on whether they got the preceding question right or wrong. 

These forms of assessment have promise but their greater ambition also involves greater risks. New question types and adaptive assessments would both require extensive trialling and development - and formulating a cheat-proof adaptive assessment bank is expensive. We would also have to consider the backwash: how would schools, teachers and students change their behaviour in response to these new kinds of tests, and would those changes be a good thing? 

Our current system of paper-based assessment does seem antiquated and there are good reasons to move assessment on screen. But there are also good reasons to be cautious. From a pedagogical perspective, we need more research on the likely effects that on-screen assessment will have on classroom practice, and more guidance on the value of using pen and paper. 

Exams

Abolishing grades

Grades are the currency of our current education system. How many times do we hear someone talk about a “straight-A” student, or a “typical grade 4” essay? Yet for all their everyday use, grades can be very distorting. Student attainment doesn’t come neatly pre-packaged in letters, numbers or any other kind of subdivision. Student attainment is continuous, rather like age or height. 

This might seem like a trivial point but it has important implications that go beyond assessment and affect classroom practice. 

Grades give all the students in a certain range the same label, and therefore imply that all those students have something in common: they are “grade 4” students. But because grades are just lines on a continuous distribution, the students at the top of one grade have more in common with a student at the bottom of the next grade than they do with the student at the other end of their own grade. 

This is a particular problem in systems where there are relatively few grades, or where one grade is particularly big. For example, at primary, where there are only three “grades” - working towards the expected standard, working at the expected standard and working at greater depth - the middle grade includes children whose attainment can be up to three years apart. This means that students who just scrape into the “expected standard” grade are performing at a very different standard from those who are right at the top of that grade and have just missed out on “greater depth”. 

Grades are distorting. They do not deserve the certainty we give them

Not only that but there are many different ways to achieve the same grade. A student who gets a grade 4 in maths may be very good at data handling and very weak at algebra, or vice versa. A student with a grade 5 in English language could be a weak writer and a strong reader, or vice versa. But grades and grade descriptors make it seem as though all the students in that category have met the same standards and have the same understanding of a topic. 

Finally, grades make it seem as though a student who moves from one grade to another has made significant progress when, in fact, they may not have done: the improvement could have more to do with measurement error. 

So, if grades aren’t reliable, what could we use instead?

One option is to report scaled scores, which already sit underneath most grading systems. A student might get a score of 1,000 instead of a grade 4, or a 1,500 instead of a grade 9. The government, universities and employers could decide which scaled-score thresholds they would require for certain courses or jobs. 

If we were also to report the confidence intervals around those scaled scores, it would help to support public understanding of assessment. Currently, confidence intervals - a range around a given measurement that indicates how precise that measurement is - are not well understood. But, by reporting these alongside scaled scores, it would become more apparent that all assessment results are subject to a degree of uncertainty.

Grades are distorting, but they distort in ways that we have become so familiar with that they feel real. They do not deserve the certainty we give them, but this certainty gives people faith in the system. 

Exams

Artificial intelligence

Exams in England feature a lot of extended written answers that are hard to mark consistently and quickly. Research from Ofqual in 2016 suggests that on a typical 40-mark essay question, markers might disagree by as much as five marks either way. That means an essay that deserves 20 marks out of 40 might actually get anywhere from 15 to 25 marks.

We often think that such problems can be solved by some combination of hiring better markers and giving them more training and time. But this ignores the fact that marking essays is an intrinsically difficult task.

So, instead of trying to improve the current system, we might be better off thinking creatively and seeing if there are other ways we could improve reliability.

One possibility is to use artificial intelligence to mark essays. The great advantage of artificial intelligence is that it will always be more consistent than humans. If you ask a human to mark the same essay twice at different times, they will often disagree with themselves. But if an AI model is asked to assess the same essay twice at different times, it will always give it the same mark. 

However, that doesn’t mean it will give the essay the right mark: it could just be consistently wrong. In a 2001 paper, some researchers were challenged to see if they could fool an AI marking system. The most successful cheat was an essay consisting of the same paragraph repeated 37 times. 

An AI model will always give an essay the same mark. That doesn’t mean it will be the right mark

Of course, AI has moved on a tremendous amount since 2001, and modern machine learning models are capable of great sophistication. Some of these models were explored in a recent Tes feature, “Will a machine soon be doing your marking?” And just this month, the new ChatGPT AI model was released, astonishing people with its ability to create original and human-sounding texts.

But these sophisticated models are also famously opaque. No one knows why they give the results they do, and the risk is that we’d direct teachers’ and pupils’ efforts into “gaming” an artificial intelligence marking model that no human truly understands. 

Given the current pace of AI development, we would be foolish to write AI off completely, though. This technology certainly deserves further research but we also need to be realistic about the risks and trade-offs involved. 

Another possible solution is to use comparative judgement, which is a different way of assessing writing. Unlike AI, it doesn’t seek to remove the human assessor but rather to find a way to make human assessment more accurate. It relies on the fundamental psychological principle that humans are better at making comparative judgements than absolute ones. So, instead of looking at one essay at a time and marking it against a mark scheme, with comparative judgement, the marker will look at two essays on screen and decide which is the better essay. Thousands of markers make hundreds of thousands of decisions and the decisions are combined together to create a measurement scale for every essay. 

At No More Marking, where I work, we have used comparative judgement to assess nearly a million pieces of writing in the past five years, proving that it can work for large-scale national assessments. However, all our assessments have been low stakes, and further trialling would be needed to see if comparative judgement could work for high-stakes exams. 

Ultimately, then, where does all this leave us? There are no instant improvements to our assessment system. But there are a number of promising areas of research, which could help deliver incremental improvement over the next decade or so. Perhaps we’d be better off focusing our attention on these than on waging all-out war against exams.

Daisy Christodoulou is director of education at No More Marking

You need a Tes subscription to read this article

Subscribe now to read this article and get other subscriber-only content:

  • Unlimited access to all Tes magazine content
  • Exclusive subscriber-only stories
  • Award-winning email newsletters

Already a subscriber? Log in

You need a subscription to read this article

Subscribe now to read this article and get other subscriber-only content, including:

  • Unlimited access to all Tes magazine content
  • Exclusive subscriber-only stories
  • Award-winning email newsletters

topics in this article

Recent
Most read
Most shared