There was widespread condemnation again last week of the publication of "raw" examination results, because they lead to unfair comparisons between schools. So let's look again at the evidence.
The Scottish Office Audit Unit publication, How good are our results?, describes two methods of measuring schools' performance at Higher which are generally assumed to be "fairer" because they allow for the social differences between students by considering "prior performance".
The candidates of, say, geography, are identified according to their assumed "ability", as measured by Standard grade performance. The average grade obtained by each candidate in all Standard grade subjects taken the previous year is called the GPA (grade point average). The average grade obtained in Higher geography by all the candidates with a particular GPA is then used as a benchmark to measure each individual candidate's "progress".
For example, considering just those candidates who had a GPA of 2.0 (ie Credit level, on average) at Standard grade, suppose that the average grade they obtain in Higher geography is B. Any individual candidate with a GPA of 2.0 who obtains a Higher grade of A has therefore achieved "above average" (or positive) progress, and if that candidate scores a C, then they have made "below average" (or negative) progress.
There are many different factors that influence Higher results (for example, some candidates may receive additional private tuition - a "positive" influence - while some may be emotionally disturbed during their S5 year - a "negative" influence). However, when the average progress made by all Higher geography candidates in a school is calculated (called the department's value-added indicator), we can expect these random influences to cancel one another out. Subject departments with positive indicators must therefore have done a "better than average job of preparing their candidates for the examination". At least this is the interpretation given in some Scottish Office publications.
This interpretation is, however, based on the premise that the random influences will cancel one another out. This is more or less true for large numbers of candidates (say more than 100), but most subject departments only have about 24 candidates a year.
In any one year the candidates might have more positive than negative influences and vice versa the next year; so the department's indicator can be expected to fluctuate from one year to the next. The indicator needs to be well above this level of random fluctuation before it is possible to infer that the department is doing "a better than average job", but how big is the random fluctuation?
I carried out several computer simulations to determine its size and found it to be 0.17 of a grade. Since more than 400 schools are involved, it is necessary to adopt a significance level of at least one in 400, and this produces a confidence level of 0.5 of a grade. If this value is adopted, it means that only value-added indicators that are above +0.5 or below -0. 5 are significantly different from zero, and if the indicators of subject departments in two different schools are within 0.5 grades they cannot be regarded as different.
On this basis, it is not possible to distinguish between most schools in the country, and any "league table" of value-added indicators would be meaningless. How good are our results? describes absolute values above 0.25 of a grade as "noticeably better or worse", a value that is only half as big as mine. Even then, absolute values below this level are still shown, though they are statistically insignificant.
It is notable that value-added indicators are highly correlated with performance indicators based on "raw" exam results. So a school that obtained "good" results in the previous "league tables" will generally get positive value-added indicators - a school's social context is still a more important influence on its examination results than teacher "effectiveness".
The second method of calculating a subject department's "performance" (relative rating) is to compare the average grade of candidates with the average grade they scored in all the other Higher subjects they were presented for. A subject department with a large positive relative rating is assumed to have helped its students to obtain better results than in other subjects. (Actually this simple calculation is complicated by the fact that not all subjects have the same "difficulty". The "difficulty rating" of a subject is obtained by averaging relative ratings over all candidates nationally.) Unfortunately, this determination assumes that the candidates for, say geography, in a particular school are similar to geography candidates in all the other schools. If one school contains a few particularly motivated candidates, who are less motivated in other subjects, then the relative rating of this department will be positive. The next year it might be a different department that is "favoured". So the relative ratings of a department can be expected to fluctuate from year to year through unpredictable causes in the same way as with value-added indicators. Both measures for departments with fewer than 25 candidates fluctuate more widely than those of larger departments, which suggests random influences are indeed at work.
The Scottish Office does not produce indicators for departments with fewer than 10 candidates, but my contention is that this limit is far too low.
All this is not to say that there aren't any "bad" teachers or "underperforming" departments, but it does suggest that value-added indicators and relative ratings are not much use in determining who they are. The cost of doing so is therefore indefensible.
Bob Sparkes is in the education department, Stirling University.