Home
Archived
Can we trust the results of high-stakes tests?

Back

Can we trust the results of high-stakes tests?

When teachers and schools are under intense pressure to get results, how can we know if a jump in attainment represents a real leap in learning? Daniel Koretz warns Irena Barker that the more we tie assessments to accountability, the less reliable they become

27th September 2019, 12:03am

Can We Trust The Results Of High-stakes Tests?

Irena Barker

The more you tell people they are going to be accountable for a given metric, the less you can trust that metric,” according to Daniel Koretz, Henry Lee Shattuck Research Professor of Education at the Harvard Graduate School of Education.

In an education system that often appears completely obsessed with high-stakes tests - and that holds up schools achieving the best exam results as exemplars - his comment will be music to the ears of some teachers, while others will scoff.

Either way, Koretz has done his homework: a former elementary school teacher, he has spent much of the past 30 years researching and writing about high-stakes tests, including in his books Measuring Up and The Testing Charade. All that work has left him sceptical of their reliability.

“The more the people involved in the tests perceive pressure to raise scores, the more likely they are to change their behaviour; for example, by teaching to the test or - at the extreme end of the spectrum - cheating,” he says.

It is important to note that the notion of “high-stakes” is fluid, Koretz adds.

“There are a lot of people who believe that a test is only high-stakes if there are concrete and tangible sanctions and rewards [such as job security] … but the research is pretty clear that you don’t need those concrete sanctions and rewards. All you need is to create a system in which teachers think that test scores are what’s going to matter for them.”

‘You can’t trust the results’

A mere “strategy of applied anxiety” where a school emphasises the importance of test scores regularly is enough to make tests high-stakes, Koretz explains.

But how do we know that such pressure leads to behaviours like teaching to the test and even cheating? Researchers look for score inflation, where test scores suggest gains in performance that are not reflected in gains in actual learning. This was studied empirically for the first time in the US by Koretz in 1991.

“The results of that paper were really quite striking,” he says. “The answer was, ‘No, you can’t trust the results.’ ”

For the study, Koretz and a group of researchers looked at a school district that had replaced one commercial, multiple-choice test with a similar alternative. They found that maths scores at the end of 3rd grade (Year 4) immediately dropped by half an academic year. Four years later, scores on the new test had caught up with the level that had been reached on the old test in its final year of use.

At this point, the researchers stepped in and asked pupils in a random sample of classrooms to sit the test that the district had abandoned four years earlier. Scores on that test had dropped by half an academic year.

The study concluded that information provided to the public based on the results of accountability-orientated tests could be “seriously misleading”.

Research since then has found similar outcomes. “There are relatively few studies [on score inflation in the US] but they are very, very consistent,” explains Koretz. “The typical finding is that inflation is common and can be very, very large.

“We have a number of studies in the US that show gains on high-stakes tests are three to six times as large as they should be.”

‘Huge disparities’

More recent research, Koretz adds, is starting to show that score inflation is not uniform, and that schools with more disadvantaged children are disproportionately under pressure. “The pressure increases as scores go down, so the lower the initial scores in the school, the greater the pressure to raise them,” he says.

To demonstrate the point, he cites a study from the year 2000 in which researchers examined the results of 4th graders in Texas’ state-wide TAAS accountability tests, which seemed to show a rapidly narrowing gap between minority pupils and non-Hispanic white pupils.

However, when they looked at the children’s results in the nationwide, low-stakes National Assessment of Educational Progress tests, they did not find this to be the case, instead identifying “huge disparities” between the stories told by the NAEP and local Texas test results.

Most strikingly, according to the NAEP results, the gap between white students and students of colour was large and increasing slightly, but according to Texas’ own tests, the gap was much smaller and decreasing substantially. The report suggested that the lowest-scoring schools with a greater number of low-income and minority students could be “particularly likely to suffer from overzealous efforts to raise scores”.

What might those efforts look like? Studies of teacher behaviour are usually based on surveys and rely on participants being honest about their classroom practices. Koretz led a study looking at how teachers had responded to a new high-stakes accountability regime in the state of Kentucky, where there were financial rewards for those whose scores improved and sanctions for those whose scores failed to rise (3).

“We were looking at the extent to which teachers said they decreased emphasis on some aspects of the curriculum to make time to emphasise the tests, and the answer was yes, they did that,” says Koretz.

In the study, almost three-quarters of school principals surveyed reported encouraging their teachers “a great deal” to focus instruction on skills or content likely to be included in the test. About 40 per cent of teachers reported that they “focused a great deal on increasing the match between the content of instruction and the content of the tests”.

Koretz says: “That stream of research basically confirmed over and over again one common type of inappropriate test preparation … simply reallocating instructional resources to better match the test.

“Proponents say, ‘If there’s good material on the test, why wouldn’t you do that?’ The reason is because the test is very small and incomplete. And in doing that, teachers have told us time and time again that they take time away from their important content because it’s not tested.”

Koretz also highlights a phenomenon known in the US as “coaching” : teaching test-specific strategies such as using a process of elimination in a multiple-choice assessment. Research shows that coaching inflates scores if it leads to higher performance than would result if those test-specific details were changed.

In addition, Koretz says, “gaming” practices - such as “off-rolling” children who won’t score well or holding them back a year - impact on the reliability of a set of scores for a given year group. He argues that all the above practices are made easier because of the predictability of many high-stakes tests.

“Part of the predictability is because it’s very expensive and difficult to write a new test paper. And so, it’s just easier if you have one that works to stick with something similar,” he says.

If teachers fear that they have slipped into any bad habits - or school leaders want to ensure such practices do not occur - Koretz suggests it is important for everyone to “take as their goal teaching the curriculum, not raising scores as an end in itself”. This will, ultimately, raise scores, but perhaps more slowly than if teachers were to take shortcuts such as coaching to the test.

Koretz also urges teachers to heed the words of the late Professor Audrey Qualls, a testing expert at the University of Iowa, US: “Students only really know something if they can use that knowledge when confronted with unfamiliar particulars.”

Koretz says this means that “teachers should limit the extent to which they present material in a form that mirrors the test, and in some cases should avoid it altogether”.

Tested to the limits

What Koretz is not calling for, though, is an end to testing: he has an extremely positive view of it, but only if it is used appropriately and if the stakes are not high.

“Standardised tests can tell us things no other sources of information can give us … but the key is minimising the incentive to prepare kids inappropriately,” Koretz explains. “We have a lot of good, trustworthy data showing that the gap between black and white kids has narrowed. That narrowing has slowed down in recent years but, over 25 years, we know the gap has narrowed.

“At the same time, the gap between the richest and poorest students has widened. That seems to be one of the most important pieces of descriptive information you could have about the functioning of the American school system. The only way we can know that is by using standardised tests - it’s the only metric that has the same meaning from one context to another.”

Koretz concludes:“My suggestion is that testing is essential, it’s the only good way to get really good uniform evaluations across contexts, but we are shooting ourselves in the foot by putting high stakes on them because the more we do that, the less we can believe what they tell us.”

Irena Barker is a freelance journalist

This article originally appeared in the 27 September 2019 issue under the headline “Tes focus on… High-stakes testing”

You need a Tes subscription to read this article

Subscribe now to read this article and get other subscriber-only content:

Unlimited access to all Tes magazine content
Exclusive subscriber-only stories
Award-winning email newsletters

Subscribe now

Already a subscriber? Log in

You need a subscription to read this article

Subscribe now to read this article and get other subscriber-only content, including:

Unlimited access to all Tes magazine content
Exclusive subscriber-only stories
Award-winning email newsletters