The world's most influential education league tables are "useless", produce "meaningless" rankings and are compiled using techniques that are "utterly wrong", according to damning new academic criticism.
Politicians around the globe are increasingly using Pisa (Programme for International Student Assessment) results to formulate and justify their school reforms, often focusing on the headline country rankings. But researchers are now raising serious concerns about their reliability.
In response, the Organisation for Economic Cooperation and Development (OECD), which runs Pisa, has admitted that "large variation in single (country) ranking positions is likely". But the academics that TES has spoken to argue that there are even deeper problems that the body is not acknowledging.
Professor Svend Kreiner, a statistician from the University of Copenhagen in Denmark, said that an inappropriate model is used to calculate the Pisa rankings every three years. In a paper published this summer, he challenges their reliability and shows how they fluctuate significantly according to which test questions are used. He reveals how, in the 2006 reading rankings, Canada could have been positioned anywhere between second and 25th, Japan between eighth and 40th and the UK between 14th and 30th.
Dr Hugh Morrison, from Queens University Belfast in Northern Ireland, goes further, saying that the model Pisa uses to calculate the rankings is, on its own terms, "utterly wrong" because it contains a "profound" conceptual error. For this reason, the mathematician claims, "Pisa will never work".
The academics' papers have serious implications for politicians, including England's education secretary Michael Gove, who justified his sweeping reforms by stating that England "plummeted" down the Pisa rankings between 2000 and 2009.
The questions used for Pisa vary between countries and between students participating in the same assessment. In Pisa 2006, for example, half the students were not asked any reading questions but were allocated "plausible" reading scores to help calculate their countries' rankings.
To work out these "plausible" values, Pisa uses the Rasch model, a statistical way of "scaling" up the results it does have. But Professor Kreiner says this model can only work if the questions that Pisa uses are of the same level of difficulty for each of the participating countries. He believes his research proves that this is not the case, and therefore the comparisons that Pisa makes between countries are "useless".
When the academic first raised the issue in 2011, the OECD countered by suggesting that he had been able to find such wild fluctuations in rankings only by deliberately selecting particular small groupings of questions to prove his point. But Professor Kreiner's new paper uses the same groups of questions as Pisa and comes up with very similar results to his initial analysis.
He is sceptical about the whole concept of Pisa. "It is meaningless to try to compare reading in Chinese with reading in Danish," he said.
Dr Morrison said that the Rasch model made the "impossible" claim of being able to measure ability independently of the questions that students answer. "I am certain this (problem) cannot be answered," he told TES.
The OECD's response did not address Dr Morrison's claims, but largely restated its arguments against Professor Kreiner's original 2011 paper on Pisa.
"Large variation in single ranking positions is likely, particularly among the group of countries that are clustered in the middle of the distribution, as the scores are similar," the organisation said. It claimed that country rankings take account of "the uncertainty that results from sample data" and are "therefore shown in the form of ranges rather than single ranking positions".
These ranges are given in Pisa results but not in the main tables. And although separate rankings are produced for England, no ranges are published for the country.
The OECD said Pisa questions were "tested to ensure they have the same relative difficulty across countries", but it has admitted to TES that some variation remains after this testing. On the suitability of Rasch, it said "no model exactly fits the data".
"Examination of alternative models has shown that the outcomes are identical," it added. "However, Pisa will always seek to improve and learn from the latest research methods."