Home
Why a robot will never be able do your marking

Why a robot will never be able do your marking

Instead of investing in AI, let’s refine our assessment skills – and make marking less of a chore, says Yvonne Williams

10th August 2018, 10:28am

Artificial Intelligence is set to take over another teaching function, this time complex marking of essays in the higher education sector in the US.

The tie-in with relieving the marking burden is a clever marketing ploy based on the assumptions that:

All marking is a chore;
Marking is a low-level task;
Its only function is to give a grade;
A mere 10 scripts will be sufficient to set up a marking algorithm;
It is easy to follow the mindset of a teacher/ examiner;
Therefore all marking simply boils down to a formula, which is more or less complicated;
Teachers can use the time saved by this service to perform other functions in teaching;
In producing more data, the service is providing added value;
Youch! No wonder A-level and GCSE examiners are so poorly remunerated.

I have to wonder what kind of assessments the AI tool will be addressing. Will this technological development lead to more easily assessable assignments, so that the marking can be done mechanically? And will it be to the detriment of sophisticated teaching and learning?

As a former member of the Department for Education working party that produced the report on reducing marking workload, I applaud anything that will cut the excesses of current triple and deep marking in many colours. The ridiculous complications have done much to damage the proper standing of assessment in the curriculum. Claims that computers can perform marking of higher level essays after just 10 examples devalue it further.

Accurate marking is a valuable skill

Proper formative and summative marking is a valuable skill, which takes years to refine.

Over time there have been numerous attempts to replicate the intelligence that is needed to drive excellent evaluation of pupils’ work. For example, all GCSE and A-level specifications mark schemes become more over-specific with each curriculum change. Do the curriculum experts do this to accommodate what they perceive as less-skilled teachers? The number of pages of criteria for each band rises, but ironically accuracy doesn’t necessarily improve in relation to the expansion. Research by Massey and Raikes in 2006 (you can see find it in this NFER report) compared the detailed mark scheme of economics A-level essays with the less detailed one for sociology. Their conclusions included a tentative and revealing suggestion that more detailed mark schemes were less likely to produce accurate marking because the excess “challenged the examiner’s working memory and inhibited the development of an accurate representation of the text”.

Each essay is a challenge

Numerous studies into ways of improving summative marking still haven’t found the solution to the accuracy and consistency problem. It’s an issue that has obsessed me for years, as my hapless colleagues and family will attest. I have read every study that I can lay hands on.

Last summer was taken up with research into what was required for English A-level essays to be accurately assessed. My husband, himself a senior examiner with many years’ and thousands of scripts’ experience, and I debated the issues from a number of perspectives.

Mark schemes and instructions for examiner standardisation routinely warn of the importance of dealing with “atypical” answers and approaches. My husband’s view is that every script encountered by an examiner at A level is unusual or atypical, so various are the approaches, whether at school level or national level. Each candidate is different, and the “outcome space” is huge: the number of possible responses to the 40-mark essay can be as numerous as grains of sand on a beach.

Ofqual’s symposium in 2016 on marking consistency showed that even six-mark questions posed real problems for markers.

The 2012 study for the Headmasters’ and Headmistresses’ Conference (HMC), ”England’s ‘examinations industry’: deterioration and decay”, looking at the shortcomings of the exam system, showed all too clearly the frustrations that schools express when their students’ deeper knowledge has not been rewarded. Candidates’ arguments can be excellent, even outstanding, and yet not fit the criteria completely. Somehow an examiner has to reconcile high-quality thinking with the rather less enticing mechanics of the mark scheme. Are we really serious about using AI in this context?

Excellent markers have wide-ranging and in-depth knowledge of the texts, critical views, literary concepts and contexts that lie at the heart of the specification and the question. They can recognise the skill and reward the moments of enlightenment that spring from the corners of the texts that study guides and mass-produced lesson plans cannot reach.

Such skill takes years of marking across age groups and abilities to refine into a nuanced model. Each experience provides valuable further modification.

Thus the assertion that AI can easily replicate such complex thought processes within 10 scripts seems very far fetched for the English and humanities contingent. We would do rather better to invest more time and effort in refining the natural intelligence of our colleagues in the more complex skills of assessment.

How do we find the time? We:

Cut the affectations of triple marking and coloured comments;
Reduce written comments and increase verbal feedback;
Provide time for mentoring of all members of our departments;
Make time for regular collaborative meetings to share and develop expertise;
Worry less about the volume of data drops and more about the quality of what goes into producing the data in the first place.

The sheer cost of AI marking - an inevitable consequence of the research and marketing - will, in any case, be beyond schools’ shrinking budgets. It’s far more cost-effective to strip away the embarrassing accoutrements of the accountability system and to invest time in teacher development. We need to widen our communities of learning to share experience more widely. The ultimate benefit will be better formative and summative assessment alongside greater job satisfaction and the reduction of marking “chores”.

Yvonne Williams is a head of English and drama in the south of England, and co-author of “How accurate can A-level English literature marking be?” in the November 2017 edition of English in Education, the journal of the National Association for the Teaching of English