One of the surprising facts about the Programme for International Student Assessment (Pisa) is that not all participating 15-year-olds answer the same questions.
In Pisa 2006, for example, about half the participating pupils were not asked any questions on reading, and only one in 10 pupils was tested on all 28 questions. Similarly, half the participating pupils were not tested at all on maths, even though full rankings were produced for both subjects. Science, the main focus of Pisa for that year, was the only subject that all participating pupils were tested on.
The basic reason for this is straightforward – if all pupils took all of the questions, the tests would be too long. However, Pisa still assigned reading scores to 15-year-olds who did not answer any reading questions, with the same true for maths, so that there was a full set of data to calculate country scores and rankings.
The Organisation for Economic Cooperation and Development (OECD), which runs Pisa, says it is a “system-level assessment” rather than a measurement of individual achievement. So, it argues, there is nothing wrong with calculating “plausible valuables” for pupils who were never asked particular questions.
To work out what these values should be, scores from pupils who did answer the questions are fed into a statistical “scaling model”. Up to and including Pisa 2012, the scaling model Pisa used was the “Rasch model” – a choice that turned out to be the subject of huge academic controversy.
As TES revealed in 2013, Professor Svend Kreiner, a Danish expert in Rasch, argued that it was a completely unsuitable model for Pisa and wouldn’t work, because the questions used had different levels of difficulty in different countries. As a result, he said, Pisa’s comparisons between countries were “meaningless” and “useless”.
Dr Hugh Morrison, a mathematician at Queen’s University Belfast, went further and argued that the Rasch model itself was “utterly wrong” and rendered Pisa rankings “valueless”.
At the time the OECD stuck to its guns and robustly rejected the criticism.
But in a TES interview last month, Andreas Schleicher, OECD education director, took a very different stance when he was asked about academic criticisms of Pisa. “The ones that are constant are ones that have helped shape Pisa, a lot, and our thinking as well,” he said.
When TES then brought up the criticisms of Rasch and asked whether the model was still being used, Schleicher responded: “No, we have now changed to a…That’s a good example where actually over the last few cycles, we started in 2009 to modify the Rasch model and then in 2012 we used a two-parameter variant and now in 2015 we have used the full three-parameter model that those [critics] were recommending.”
According to a subsequent interview with Pisa lead analyst Miyako Ikeda, Pisa has not actually abandoned Rasch altogether. But it has moved away from it – a change she says was brought in from 2015 rather than 2009 or 2012.
According to Ikeda, Pisa used a two parameter scaling model for the first time last year. As well looking at the level of difficulty of each question, as Rasch does, the model also takes into account how effective the question is at measuring performance; for example whether any difficulty is down to its core competence or extraneous factors such as the language it is asked in.
Pisa now uses a mix of Rasch and the two-parameter model, depending on the question, in what Ikeda describes as a “hybrid” solution. But she says there are no plans to introduce a three-parameter model – which would also take into account how guessable questions are, as well as their level of difficulty and discrimination – because that would make the test too long.
The differences between Ikeda’s and Schleicher’s accounts are worth highlighting, not to be facetious but because they illustrate how difficult it can be to fully explain the highly technical, but crucial, details, even for those at the heart of Pisa.