Is Pisa fundamentally flawed?
The results that reveal the ‘winners’ in the global race in education lead to billions of pounds’ worth of reforms. But leading academics say the rankings are ‘useless’ and based on ‘profound conceptual errors’. The world’s favourite education league table is being put to the test, William Stewart discovers
In less than five months, the most influential set of education test results the world has ever seen will be published.
Leaders of the most powerful nations on earth will be waiting anxiously to find out how they have fared in the latest edition of the Programme for International Student Assessment (Pisa).
In today’s increasingly interconnected world - where knowledge is overtaking traditional industry as the key to future prosperity - education has become the main event in the “global race”. And it is Pisa, the assessment carried out by the Organisation for Economic Cooperation and Development (OECD) every three years, that has come to be seen as its most recognised, and trustworthy, measure.
Politicians worldwide, including England’s education secretary Michael Gove, have based their case for sweeping, controversial reforms on the fact that Pisa rankings have “plummeted”. Meanwhile, top-ranked Pisa success stories such as Finland have become international bywords for educational excellence, with other ambitious countries queuing up to see how they have managed it.
Pisa 2012 - due to be published on 3 December 2013 - will create more stars, cause even more angst and persuade more governments to spend further billions on whatever reforms the survey suggests have been most successful.
But what if there were “serious problems” with the Pisa data? What if the statistical techniques employed were “utterly wrong” and based on a “profound conceptual error”? Suppose the whole idea of being able to rank such diverse education systems accurately was “meaningless”, “madness”?
What if you learned that Pisa’s comparisons were based not on a common test but on different students answering different questions? And what if switching these questions round led to huge variations in the all- important Pisa rankings, with the UK finishing anywhere between 14th and 30th and Denmark between fifth and 37th? What if these rankings - that so many reputations and billions of pounds depend on, that will have so much impact on students and teachers around the world - are in fact “useless”?
This is the worrying reality about Pisa, according to several academics who are independently reaching some damning conclusions about the world’s favourite education league tables. As far as they are concerned, the emperor has no clothes.
Perhaps even more worrying are the responses provided when TES put these academic challenges to the OECD. On the key issue of whether different countries’ education systems are correctly ranked by Pisa, and whether these rankings are reliable, the answer is less than reassuring.
The sample data used meant that there was “uncertainty”, the OECD admits. As a result, “large variation in single (country) ranking positions is likely”, it adds.
The organisation has always argued that Pisa provides much deeper and more nuanced analysis than mere rankings - offering insights into which education policies work best. But the truth is that, for many, the headline rankings are the start and finish of Pisa and that is where much of its influence lies.
On other important questions, such as whether there is a “fundamental, insoluble mathematical error” at the heart of the statistical model used for Pisa, the OECD has offered no response.
Concerns about Pisa have been publicly raised before. In England, Gove’s repeated references to the country plunging down the Pisa maths, reading and science rankings between 2000 and 2009 have led to some close scrutiny of the study’s findings.
Last year, the education secretary received an official reprimand from the UK Statistics Authority for using the “problematic” figures from 2000, which the OECD itself had already admitted were statistically invalid because not enough schools took part.
The statistics watchdog also highlighted further criticisms made by Dr John Jerrim, lecturer in economics and social statistics at the Institute of Education, University of London. He notes that England’s Pisa-related fall from grace was contradicted by the country’s scores on the rival Trends in International Mathematics and Science Study, which rose between 1999 and 2007. Jerrim’s paper also suggests that variations in the time of year that students in England took the Pisa tests between 2000 and 2009 could have skewed its results.
The OECD told TES that this amounted to speculation and that Jerrim had looked at tests only within the UK and had not addressed Pisa’s main objective - to provide a “snapshot comparison” between 15-year-olds in different countries.
But it is that “snapshot” nature of Pisa - the fact that it looks at a different cohort of 15-year-olds every three years - that is one of the chief criticisms levelled at the programme by Professor Harvey Goldstein, professor of social statistics at the University of Bristol.
“I was recommending 10 years ago to the OECD that it should try to incorporate longitudinal data and it simply hasn’t done that,” Goldstein told TES. He would like to see Pisa follow a cohort over time, like the England government’s student database, so that more causal relationships could be studied.
Criticisms about Pisa’s scope are important. But it is the deeper methodological challenges now being made that are probably even more significant, although harder to penetrate.
It should be no surprise that they have arisen. Any attempt to achieve a fair comparison between more than 50 different education systems operating in a huge variety of cultures, that allows them to be accurately ranked on simple common measures was always going to be enormously difficult to deliver in a way that everyone agrees with.
Goldstein notes concerns that questions used in Pisa tests have been culturally or linguistically biased towards certain countries. But he explains that when the OECD has tried to tackle this problem by ruling out questions suspected of bias, it can have the effect of “smoothing out” key differences between countries.
“Many of the important things simply don’t get commented on,” he warns. “What you are looking at is something that happens to be common. But (is it) worth looking at? Pisa results are taken at face value as providing some sort of common standard across countries. But as soon as you begin to unpick it, I think that all falls apart.”
For Goldstein, the set of questions Pisa compiles to ensure comparability ends up being the lowest common denominator.
“There is a risk to that,” admits Michael Davidson, head of the OECD’s schools division. “In a way you can’t win, can you?” Nevertheless, Pisa still finds “big differences” between countries, he says, despite “weeding out what we suspect are culturally biased questions”.
He also concedes that some of the surviving questions do still “behave” differently in different countries. And as we shall see, it is this issue that lies at the heart of some of the biggest claims against Pisa.
The Pisa rankings are like any education league tables in that they wield enormous influence, but because of their necessary simplicity they are imperfect. Where it could be argued that they differ is in a lack of awareness about these limitations.
The stakes are also high in England’s domestic performance tables, with the very existence of some schools resting on their outcome. But here the chosen headline measure - the proportion of students achieving five good GCSEs including English and maths - is easily understandable. Its shortcomings - that it comes nowhere near to capturing everything schools do and encourages a disproportionate focus on students on the C/D grade borderline - are also widely known.
Pisa by comparison is in effect a black box, with little public knowledge about what goes on inside. Countries are ranked separately in reading, maths and science, according to scores based on their students’ achievements in special Pisa tests. These are representative rather than actual scores because they have been adjusted to fit a common scale - where the OECD average is always 500. So in the previous Pisa assessment, for example, Shanghai finished top in reading with 556, the US matched the OECD average with 500 and Kyrgyzstan finished bottom with 314.
You might think that to achieve a fair comparison, and bearing in mind that culturally biased questions have been “weeded out”, that all students participating in Pisa would have been asked to respond to exactly the same questions.
But you would be wrong. For example, in Pisa 2006, about half of the participating students were not asked any questions on reading and half were not tested at all on maths, although full rankings were produced for both subjects. Science, the main focus of Pisa that year, was the only subject on which all participating students were tested.
Professor Svend Kreiner of the University of Copenhagen has looked in detail at the reading results for 2006 and notes that another 40 per cent of participating students were tested on just 14 of the 28 reading questions used in the assessment. So only approximately 10 per cent of the students who took part in Pisa were tested on all 28 reading questions.
“This in itself is ridiculous,” Kreiner told TES. “Most people don’t know that half of the students taking part in Pisa (2006) do not respond to any reading item at all. Despite that, Pisa assigns reading scores to these children.”
People may also be unaware that the differences in questions don’t just occur between students within the same country. There are also between- country differences in the questions.
For example, eight of the 28 reading questions used in Pisa 2006 were deleted from the final analysis in some countries. The OECD says that this is because they were considered “dodgy” and “had poor psychometric properties in a particular country”. However, in other countries the data from these questions did contribute to their Pisa scores.
In other words, the test questions used will vary both between students and between different countries participating in exactly the same Pisa assessment.
The OECD offered TES the following explanation for this seemingly unlikely scenario: “It is important to recognise that Pisa is a system-level assessment and the test design is created with that goal in mind. The Pisa assessment does not generate scores for individuals but instead calculates plausible values for each student in order to provide system aggregates.”
It then referred to an explanation in a Pisa technical report, which notes: “It is very important to recognise that plausible values are not test scores and should not be treated as such. They are random numbers drawn from the distribution of scores that could be reasonably assigned to each individual.” In other words, much of the Pisa rankings are not based on actual student performance at all, but on “random numbers”.
To calculate these random “plausible values”, the OECD uses something called the Rasch model. By feeding actual student scores into this statistical “scaling model”, Pisa’s administrators aim to work out a plausible version of what the scores would have been if all students in all countries had answered the same questions.
It is the Rasch model that is at the heart of some of the strongest academic criticisms being made about Pisa. It is also the black box within Pisa’s black box. Exactly how the model works is something that few people fully understand.
But Kreiner does. He was a student of Georg Rasch, the Danish statistician who gave his name to the model, and has personally worked with the model for 40 years. “I know that model well,” Kreiner told TES, “I know exactly what goes on there.” And that is why he is worried about Pisa.
He says that for the Rasch model to work for Pisa, all the questions used in the study would have to work in exactly the same way - be equally difficult - in all participating countries. According to Kreiner, if the questions have “different degrees of difficulty in different countries” - if, in technical terms, there is differential item (question) functioning (DIF) - Rasch should not be used.
“That was the first thing that I looked for and I found extremely strong evidence of DIF,” he says. “That means that (Pisa) comparisons between countries are meaningless.”
Of course, as already stated, the OECD does seek to “weed out” questions that are biased towards particular countries. But, as the OECD’s Davidson admits, “there is some variation” in the way questions work in different countries, even after the weeding out. “Of course there is,” he says. “It would be ridiculous to expect every item (answer) to behave exactly the same. What we work to do is to minimise that variation.”
But Kreiner’s research suggests that this variation is still too much to allow the Rasch model to work properly. In 2010, he took the Pisa 2006 reading test data and fed them through Rasch himself. He found that the OECD’s claims did not stand up because countries’ rankings varied widely depending on the questions used. That meant that the data were unsuitable for Rasch and therefore Pisa was “not reliable at all”.
“I am not actually able to find two items in Pisa’s tests that function in exactly the same way in different countries,” he said in 2011. “There is not one single item that is the same across all 56 countries. Therefore, you cannot use this model.”
The OECD hit back with a paper the same year written by one of its technical advisers, Ray Adams, arguing that Kreiner’s work was based only on analysing questions in small groups selected to show the most variation. It suggested that when a large pool of questions was used, any variations in rankings would be evened out.
But Kreiner responded with a new paper this summer that broke the 2006 reading questions down in the same groupings used by Pisa. It did not include the eight questions that the OECD admits were “dodgy” for some countries, but it still found huge variations in countries’ rankings depending on which groups of questions were used.
In addition to the UK and Denmark variations already mentioned, the different questions meant that Canada could finish anywhere from second place to 25th and Japan from eighth to 40th. It is, Kreiner says, more evidence that the Rasch model is not suitable for Pisa and that “the best we can say about Pisa rankings is that they are useless”.
According to the OECD, not all Rasch experts agree with Kreiner’s position that the model cannot be used when DIF is present. It also argues that no statistical model will exactly fit the data anyway, to which Kreiner responds: “It is true that all statistical models are wrong. But some models are more wrong than other models and there is no reason to use the worst model.” Kreiner further accuses the OECD of failing to produce any evidence to prove that its “rankings” are robust.
The organisation says that researchers can request the data. But more significantly it has now admitted that there is “uncertainty” surrounding Pisa country rankings and that “large variation in single ranking positions is likely”.
It attributes this to “sample data” rather than the unsuitability of its statistical model. But the ranges in possible rankings that the OECD has quoted, although smaller than Kreiner’s, may still raise eyebrows. In 2009, it said that the UK’s ranking for reading was between 19th and 27th, between 23rd and 31st for maths, and 14th and 19th for science.
Deep concerns about the Rasch model are also being raised by Dr Hugh Morrison from Queen’s University Belfast. The mathematician doesn’t just think that the model is unsuitable for Pisa; he is convinced that the model itself is “utterly wrong”.
Morrison argues that at the heart of Rasch, and other similar statistical models, lies a fundamental, insoluble mathematical error that renders Pisa rankings “valueless” and means that Pisa “will never work”.
He says the model insists that when a group of students of the same ability answer the same question - in perfect conditions, when everything else is equal - there will always be some students with correct answers and some with wrong ones. But Morrison says that this is an error because, in those circumstances, the students would by definition all give a correct answer or would all give an incorrect one, because they all have the same ability.
“This is my fundamental problem, which no one can solve as far as I can see because there is no way of solving it,” says the academic, who wants to give others the chance to refute it mathematically. “I am a fairly cautious mathematician and I am certain this cannot be answered.”
Morrison also contests Rasch because he says that it makes the “impossible” claim of being able to measure ability independently of the questions that students answer. “Consider a GCSE candidate who scores 100 per cent in GCSE foundation mathematics,” he said.
“If Einstein were to take the same paper, it seems likely that he, too, would score 100 per cent. Are we to assume that Einstein and the pupil have the same mathematical ability? Wouldn’t the following, more conservative, claim be closer to the truth: that Einstein and the pupil have the same mathematical ability relative to the foundation mathematics test?”
When TES put Morrison’s concerns to the OECD, it replied by repeating its rebuttal of Kreiner. But the Dane makes a different point and argues against the suitability of Rasch for Pisa, whereas Morrison says that the model itself is “completely incoherent”.
After being forwarded a brief summary of Morrison’s case, Goldstein says he highlights “an important technical issue” rather than a “fundamental conceptual error”. But where the two academics do agree is in their opinion of the OECD’s response to criticism. Morrison has put his points to several senior people in the organisation and says that they were greeted with “absolute silence”. “I was amazed at how unforthcoming they were,” he told TES. “That makes me suspicious.”
Goldstein first published his criticisms of Pisa in 2004, but says they have yet to be addressed. “Pisa steadfastly ignored many of these issues,” he says. “I am still concerned.”
The OECD told TES: “The technical material on Pisa that is produced each cycle seeks to be transparent about Pisa methods. Pisa will always be ready to improve on that - filling gaps that are perceived to exist or to remove ambiguities and to discuss these methodologies publicly.”
But Kreiner is unconvinced: “One of the problems that everybody has with Pisa is that they don’t want to discuss things with people criticising or asking questions concerning the results. They didn’t want to talk to me at all. I am sure it is because they can’t defend themselves.”
WHAT’S WRONG WITH PISA?
- “It (Pisa) has been used inappropriately and some of the blame for that lies with Pisa. I think it tends to say too much for what it can do and it tends not to publicise the negative or the weaker aspects.” Professor Harvey Goldstein, University of Bristol
- “The main point is, it is not up to the rest of the world to show they (the OECD) are wrong. It is up to Pisa to show they are right. They are claiming they are doing something and they are getting a lot of money to do it and they should support their claims.” Professor Svend Kreiner, University of Copenhagen
- “There are very few things you can summarise with a number and yet Pisa claims to be able to capture a country’s entire education system in just three of them. It can’t be possible.” Dr Hugh Morrison, Queen’s University Belfast
- “We are as transparent as I think we can be.” Michael Davidson, head of the OECD’s schools division