As the Highers begin, Richard Burton and David Miller raise some disturbing questions about multiple choice
IT is a commonplace that some examination candidates achieve unexpectedly good or bad results and this undoubtedly applies to the Higher examinations. Expectations may be false, of course, and, in principle if not in fact, papers could be badly constructed or wrongly graded. However, what concerns us are two ways in which grades may be inappropriate, simply through chance.
First - as is obvious to most examinees, but sometimes forgotten by test setters - each candidate may be "lucky" or "unlucky" in the questions asked as to whether the "topics" covered match those revised. This problem is reduced when the questions are many and varied. This is one good reason for using multiple choice and short answer questions as opposed, say, to essays (though ease and consistency of marking is no small consideration).
Even so, papers would need to be unreasonably long to render this chance element negligible. This brings us to the main point: as currently applied, the multiple choice components introduce a second, important chance element. In recent Higher biology and chemistry papers, there were 30 multiple choice questions. Each offers four possible answers from which one must be selected. Since the test cannot work fairly if some questions are left unanswered, candidates are advised to attempt them all.
Consider first a candidate who knows nothing at all of the subject (one sitting the wrong paper perhaps?). Through blind guessing it might be expected that about a quarter of the answers would be right. However, the laws of chance decree a wide range of scores, with only a minority scoring seven or eight. Indeed, about one in 20 might be expected to score 12-14 out of 30 - and that would seem a fairly creditable score. About one in 10 "blind guessers" is expected to score four or less, which might seem "unfair" in the other direction.
To offset this bleak view of test reliability, but only temporarily, we should note that the best informed candidates have little need to guess, so that their scores will be more reliable. Thus, 84.4 per cent of candidates knowing 27 correct answers should score 27 or 28 out of 30, with only 1.6 per cent getting full marks.
More interesting is the intermediate range of scores. Let us compare just two groups of examinees, one knowing the answers to 12 questions (40 per cent of the 30 marks) and the other knowing the answers to 18 (60 per cent of the 30).
How well should the test be expected to differentiate between these distinctly different levels of knowledge (which may well straddle grade boundaries)? Assuming completely random guessing whenever the correct answer is unknown, a score of 19 should be expected for 8.2 per cent of the more ignorant group and for 12.7 per cent of the other. Adding to this overlap of scores, 5.6 per cent of the former group should score 20-22 and 3.2 per cent of more informed candidates should score as few as 18, their minimum mark. Altogether, the scores of 28.3 per cent of the less informed students will overlap those of the other group, an extremely poor resolution.
It is sometimes asserted that little blind guessing occurs, despite published evidence to the contrary. What must be commonplace is for an examinee to be able to rule out one or two of the three incorrect answers.
Whenever this happens, the odds on a lucky guess are increased. That this leads to higher average scores is appropriate so long as it is considered creditworthy just to recognise a wrong answer.
Unfortunately, test score reliability is further reduced by the higher probability of guessing correctly. Suppose, for example, that 100 candidates happen to know no correct answer, but can each rule out just one of the wrong options for every question. Every guess then has a one in three chance of success. The commonest score should rise to 10 out of 30, while about four of the 100 candidates, knowing no correct answers at all, should score 15-17, half marks.
In the Higher physics examination, there are only 20 multiple choice questions, implying a significantly lower reliability than for biology.
Against this, there are five answer options, instead of four, reducing the chances of lucky guesses. Unfortunately, five versus four options does not make up for 20 versus 30 questions - reliability is still reduced compared with biology and chemistry.
he multiple choice portion must contribute to the overall discriminatory power of the examination by increasing the total number of items tested, but it is regrettable that it introduces such a huge element of unreliability. How many points are contributed by the remainder of the exam depends on the subject. To quote figures for past years, in biology the rest of the examination adds a maximum of 100 points to the 30 from the multiple choice portion. In chemistry, the remainder contributes up to 70 points, making the unreliability of the multiple choice element proportionately more important. In physics, the remainder contributes about 76 points.
The multiple choice sections should not be abandoned, but they could be improved. The standard way to discourage guessing is to deduct fractional marks for wrong answers (but no longer requiring that all questions be answered). If that were done, it would be advantageous to change from multiple choice questions to "truefalse" items, in which the veracity of simple statements has to be judged.
These work well in universities. One of their advantages is that more items can be tested in the same length of time, increasing test reliability even further through allowing better coverage of the syllabus.
Richard Burton and David Miller work in the Institute of Biomedical and Life Sciences at Glasgow University.