Is Pisa fundamentally flawed?
They are the world’s most trusted education league tables. But academics say the Programme for International Student Assessment rankings are based on a ‘profound conceptual error’. So should countries be basing reforms on them?
In less than five months, the most influential set of education test results the world has ever seen will be published. Leaders of the most powerful nations on Earth will be waiting anxiously to find out how they have fared in the latest Programme for International Student Assessment (Pisa).
In today’s increasingly interconnected world, where knowledge is supplanting traditional industry as the key to future prosperity, education has become the main event in the “global race”. And Pisa, the assessment carried out by the Organisation for Economic Cooperation and Development (OECD) every three years, has come to be seen as education’s most recognised and trustworthy measure.
Politicians worldwide, such as England’s education secretary Michael Gove, have based their case for sweeping, controversial reforms on the fact that their countries’ Pisa rankings have “plummeted”. Meanwhile, top-ranked success stories such as Finland have become international bywords for educational excellence, with other ambitious countries queuing up to see how they have managed it.
Pisa 2012 - due to be published on 3 December 2013 - will create more stars, cause even more angst and persuade more governments to spend further billions on whatever reforms the survey suggests have been most successful.
But what if there are “serious problems” with the Pisa data? What if the statistical techniques used to compile it are “utterly wrong” and based on a “profound conceptual error”? Suppose the whole idea of being able to accurately rank such diverse education systems is “meaningless”, “madness”?
What if you learned that Pisa’s comparisons are not based on a common test, but on different students answering different questions? And what if switching these questions around leads to huge variations in the all- important Pisa rankings, with the UK finishing anywhere between 14th and 30th and Denmark between fifth and 37th? What if these rankings - that so many reputations and billions of pounds depend on, that have so much impact on students and teachers around the world - are in fact “useless”?
This is the worrying reality of Pisa, according to several academics who are independently reaching some damning conclusions about the world’s favourite education league tables. As far as they are concerned, the emperor has no clothes.
Perhaps just as worrying are the responses provided when TES put the academics’ arguments to the OECD. On the key issue of whether different countries’ education systems are correctly ranked by Pisa, and whether these rankings are reliable, the answer is less than reassuring.
The sample data used mean that there is “uncertainty”, the OECD admits. As a result, “large variation in single (country) ranking positions is likely”, it adds.
The organisation has always argued that Pisa provides much deeper and more nuanced analysis than mere rankings, offering insights into which education policies work best. But the truth is that, for many, the headline rankings are the start and finish of Pisa and that is where much of its influence lies.
On other important questions, such as whether there is a “fundamental, insoluble mathematical error” at the heart of the statistical model used for Pisa, the OECD has offered no response.
Concerns about Pisa have been raised publicly before. In England, Gove’s repeated references to the country plunging down the Pisa maths, reading and science rankings between 2000 and 2009 have led to close scrutiny of the study’s findings.
Last year, the education secretary received an official reprimand from the UK Statistics Authority for citing the “problematic” UK figures from 2000, which the OECD itself had already admitted were statistically invalid because not enough schools took part.
The statistics watchdog also highlighted further criticisms made by Dr John Jerrim, a lecturer in economics and social statistics at the Institute of Education, University of London. He notes that England’s Pisa fall from grace was contradicted by the country’s scores in the rival Trends in International Mathematics and Science Study, which rose between 1999 and 2007. Jerrim’s paper also suggests that variations in the time of year that students in England took the Pisa tests between 2000 and 2009 could have skewed its results.
The OECD tells TES that this amounts to speculation, and that Jerrim looked only at tests within the UK and did not address Pisa’s main objective, which is to provide a “snapshot comparison” between 15-year- olds in different countries.
But it is the “snapshot” nature of Pisa - the fact that it looks at a different cohort of 15-year-olds every three years - that is one of the chief criticisms levelled at the programme by Harvey Goldstein, professor of social statistics at the University of Bristol in the South West of England.
“I was recommending 10 years ago to the OECD that it should try to incorporate longitudinal data, and it simply hasn’t done that,” Goldstein tells TES. He would like to see Pisa follow a cohort over time, as the government pupil database in England does, so that more causal relationships could be studied.
Criticisms of Pisa’s scope are important. But the deeper methodological challenges now being made are probably even more significant, although harder to penetrate.
It should be no surprise that they have arisen. A fair comparison between more than 50 different education systems operating in a huge variety of cultures, which allows them to be accurately ranked on simple common measures, was always going to be enormously difficult to deliver in a way that everyone agrees with.
Goldstein notes concerns that questions used in Pisa tests have been culturally or linguistically biased towards certain countries. But he explains that when the OECD has tried to tackle this problem by ruling out questions suspected of bias, it can have the effect of “smoothing out” key differences between countries.
“That is leaving out many of the important things,” he warns. “They simply don’t get commented on. What you are looking at is something that happens to be common. But (is it) worth looking at? Pisa results are taken at face value as providing some sort of common standard across countries. But as soon as you begin to unpick it, I think that all falls apart.”
For Goldstein, the questions that Pisa ends up with to ensure comparability can tend towards the lowest common denominator.
“There is a risk to that,” admits Michael Davidson, head of the OECD’s schools division. “In a way, you can’t win, can you?” Nevertheless, Pisa still finds “big differences” between countries, he says, despite “weeding out what we suspect are culturally biased questions”.
Davidson also concedes that some of the surviving questions still “behave” differently in different countries. And, as we shall see, this issue lies at the heart of some of the biggest claims against Pisa.
The Pisa rankings are like any education league tables in that they wield enormous influence, but because of their necessary simplicity are imperfect. Where it could be argued that they differ is in a lack of awareness about these limitations.
The stakes are also high in England’s domestic performance tables, with the very existence of some schools resting on their outcomes. But in England the chosen headline measure - the proportion of students achieving five A*-C GCSEs including English and maths - is easily understandable. Its shortcomings - it comes nowhere near to capturing everything that schools do and encourages a disproportionate focus on students on the C-D grade borderline - are also widely known.
Pisa, by comparison, is effectively a black box, with little public knowledge about what goes on inside. Countries are ranked separately in reading, maths and science, according to scores based on their students’ achievements in special Pisa tests. These are representative rather than actual scores because they have been adjusted to fit a common scale, where the OECD average is always 500. So in the 2009 Pisa assessment, for example, Shanghai finished top in reading with 556, the US matched the OECD average with 500 and Kyrgyzstan finished bottom with 314.
You might think that to achieve a fair comparison - and bearing in mind that culturally biased questions have been “weeded out” - all students participating in Pisa would have been asked to respond to exactly the same questions.
But you would be wrong. For example, in Pisa 2006, about half the participating students were not asked any questions on reading and half were not tested at all on maths, although full rankings were produced for both subjects. Science, the main focus of Pisa that year, was the only subject that all participating students were tested on.
Professor Svend Kreiner of the University of Copenhagen, Denmark, has looked at the reading results for 2006 in detail and notes that another 40 per cent of participating students were tested on just 14 of the 28 reading questions used in the assessment. So only approximately 10 per cent of the students who took part in Pisa were tested on all 28 reading questions.
“This in itself is ridiculous,” Kreiner tells TES. “Most people don’t know that half of the students taking part in Pisa (2006) do not respond to any reading item at all. Despite that, Pisa assigns reading scores to these children.”
People may also be unaware that the variation in questions isn’t merely between students within the same country. There is also between-country variation.
For example, eight of the 28 reading questions used in Pisa 2006 were deleted from the final analysis in some countries. The OECD says that this was because they were considered to be “dodgy” and “had poor psychometric properties in a particular country”. However, in other countries the data from these questions did contribute to their Pisa scores.
In short, the test questions used vary between students and between countries participating in exactly the same Pisa assessment.
The OECD offered TES the following explanation for this seemingly unlikely scenario: “It is important to recognise that Pisa is a system-level assessment and the test design is created with that goal in mind. The Pisa assessment does not generate scores for individuals but instead calculates plausible values for each student in order to provide system aggregates.”
It then referred to an explanation in a Pisa technical report, which notes: “It is very important to recognise that plausible values are not test scores and should not be treated as such. They are random numbers drawn from the distribution of scores that could be reasonably assigned to each individual.” In other words, a large portion of the Pisa rankings is not based on actual student performance at all, but on “random numbers”.
To calculate these “plausible values”, the OECD uses something called the Rasch model. By feeding actual student scores into this statistical “scaling model”, Pisa’s administrators aim to work out a plausible version of what the scores would have been if all students in all countries had answered the same questions.
Inside the black box
The Rasch model is at the heart of some of the strongest criticisms being made of Pisa. It is also the black box within Pisa’s black box: exactly how the model works is something that few people fully understand.
But Kreiner does. He was a student of Georg Rasch, the Danish statistician who gave his name to the model, and has personally worked with it for 40 years. “I know that model well,” Kreiner tells TES. “I know exactly what goes on there.” And that is why he is worried about Pisa.
He says that for the Rasch model to work for Pisa, all the questions used in the study would have to function in exactly the same way - be equally difficult - in all participating countries. According to Kreiner, if the questions have “different degrees of difficulty in different countries” - if, in technical terms, there is differential item functioning (DIF) - Rasch should not be used.
“That was the first thing that I looked for, and I found extremely strong evidence of DIF,” he says. “That means that (Pisa) comparisons between countries are meaningless.”
Of course, as already stated, the OECD does seek to “weed out” questions that are biased towards particular countries. But, as the OECD’s Davidson admits, “there is some variation” in the way that questions work in different countries even after that weeding out. “Of course there is,” he says. “It would be ridiculous to expect every item (question) to behave exactly the same. What we work to do is to minimise that variation.”
But Kreiner’s research suggests that the variation is still too much to allow the Rasch model to work properly. In 2010, he took the Pisa 2006 reading test data and fed them through the Rasch model himself. He said that the OECD’s claims did not stand up because countries’ rankings varied widely depending on the questions used. That meant the data were unsuitable for Rasch and therefore Pisa was “not reliable at all”.
“I am not actually able to find two items in Pisa’s tests that function in exactly the same way in different countries,” Kreiner said in 2011. “There is not one single item that is the same across all 56 countries. Therefore, you cannot use this model.”
The OECD hit back with a paper the same year written by one of its technical advisers, Ray Adams, who argued that Kreiner’s work was based only on analysis of questions in small groups selected to show the most variation. The organisation suggested that when a large pool of questions was used, any variations in rankings would be evened out.
But Kreiner responded with a new paper this summer that broke the 2006 reading questions down in the same groupings used by Pisa. It did not include the eight questions that the OECD admits were “dodgy” for some countries, but it still found huge variations in countries’ rankings depending on which groups of questions were used.
In addition to the UK and Denmark variations already mentioned, the different questions meant that Canada could have finished anywhere between second and 25th and Japan between eighth and 40th. It is, Kreiner says, more evidence that the Rasch model is not suitable for Pisa and that “the best we can say about Pisa rankings is that they are useless”.
According to the OECD, not all Rasch experts agree with Kreiner’s position that the model cannot be used when DIF is present. It also argues that no statistical model will exactly fit the data anyway, to which Kreiner responds: “It is true that all statistical models are wrong. But some models are more wrong than other models and there is no reason to use the worst model.” Kreiner further accuses the OECD of failing to produce any evidence to prove that its rankings are robust.
The organisation says that researchers can request the data. But, more significantly, it has now admitted that there is “uncertainty” surrounding Pisa country rankings and that “large variation in single ranking positions is likely”.
It attributes this to “sample data” rather than the unsuitability of its statistical model. But the variation in possible rankings quoted by the OECD, although smaller than Kreiner’s, may still raise eyebrows. In 2009, the organisation said the UK’s ranking was between 19th and 27th for reading, between 23rd and 31st for maths, and between 14th and 19th for science.
Serious concerns about the Rasch model have also been raised by Dr Hugh Morrison from Queen’s University Belfast in Northern Ireland. The mathematician doesn’t just think that the model is unsuitable for Pisa - he is convinced that the model itself is “utterly wrong”.
Morrison argues that at the heart of Rasch, and other similar statistical models, lies a fundamental, insoluble mathematical error that renders Pisa rankings “valueless” and means that the programme “will never work”.
He says the model insists that when students of the same ability answer the same question - in perfect conditions, when everything else is equal - some students will always answer correctly and some incorrectly. But Morrison argues that, in those circumstances, the students would by definition all give a correct answer or would all give an incorrect one, because they all have the same ability.
“This is my fundamental problem, which no one can solve as far as I can see because there is no way of solving it,” says the academic, who wants to give others the chance to refute his argument mathematically. “I am a fairly cautious mathematician and I am certain this cannot be answered.”
Morrison also contests Rasch because he says the model makes the “impossible” claim of being able to measure ability independently of the questions that students answer. “Consider a GCSE candidate who scores 100 per cent in GCSE foundation mathematics,” he tells TES. “If Einstein were to take the same paper, it seems likely that he, too, would score 100 per cent. Are we to assume that Einstein and the pupil have the same mathematical ability? Wouldn’t the following, more conservative claim, be closer to the truth: that Einstein and the pupil have the same mathematical ability relative to the foundation mathematics test?”
When TES put Morrison’s concerns to the OECD, it replied by repeating its rebuttal of Kreiner’s arguments. But Kreiner makes a different point and argues against the suitability of Rasch for Pisa, whereas Morrison claims that the model itself is “completely incoherent”.
After being forwarded a brief summary of Morrison’s case, Goldstein says that he highlights “an important technical issue”, rather than the “profound conceptual error” claimed by the Belfast academic. But the two do agree in their opinion of the OECD’s response to criticism. Morrison has put his points to several senior people in the organisation, and says that they were greeted with “absolute silence”. “I was amazed at how unforthcoming they were,” he says. “That makes me suspicious.”
Goldstein first published his criticisms of Pisa in 2004, but he says they have yet to be addressed. “Pisa steadfastly ignored many of these issues,” he says. “I am still concerned.”
The OECD tells TES that: “The technical material on Pisa that is produced each cycle seeks to be transparent about Pisa methods. Pisa will always be ready to improve on that, filling gaps that are perceived to exist or to remove ambiguities and to discuss these methodologies publicly.”
But Kreiner is unconvinced: “One of the problems that everybody has with Pisa is that they don’t want to discuss things with people criticising or asking questions concerning the results. They didn’t want to talk to me at all. I am sure it is because they can’t defend themselves.”
What’s the problem?
“(Pisa) has been used inappropriately and some of the blame for that lies with Pisa itself. I think it tends to say too much for what it can do and it tends not to publicise the negative or the weaker aspects.” Professor Harvey Goldstein, University of Bristol, England
“We are as transparent as I think we can be.”
Michael Davidson, head of the OECD’s schools division
“The main point is, it is not up to the rest of the world to show they (the OECD) are wrong. It is up to Pisa to show they are right. They are claiming they are doing something and they are getting a lot of money to do it, and they should support their claims.” Professor Svend Kreiner, University of Copenhagen, Denmark
“There are very few things you can summarise with a number and yet Pisa claims to be able to capture a country’s entire education system in just three of them. It can’t be possible. It is madness.” Dr Hugh Morrison, Queen’s University Belfast, Northern Ireland
- Papers by Professor Svend Kreiner: bit.ly/SvendKreiner and
- Article and link to paper by Dr John Jerrim: bit.ly/EnglandPerformance
- Paper by Professor Harvey Goldstein: bit.ly/HarveyGoldstein
- Paper by OECD technical adviser Ray Adams: bit.ly/RayAdams
- Paper by Dr Hugh Morrison: bit.ly/MorrisonBelfast.
Photo credit: Getty