Attacks on Pisa are entirely unjustified

All assessments are subject to some uncertainty, but the OECD programme provides robust comparisons of nations’ performance

2nd August 2013, 1:00am

Andreas Schleicher

In its 26 July edition, TESS finds it remarkable that the OECD has “admitted” that rankings from Pisa surveys have a degree of uncertainty (“Is Pisa fundamentally flawed?”). But the range of possible ranks has never been a secret. Any assessment of the skills of people, whether it is a high-school exam, a driving test or an international assessment such as Pisa (the Programme for International Student Assessment), will have some uncertainty because the results depend on the tasks that are chosen, on variations in how the assessment was administered and even on the disposition of the person taking the test.

The goal of Pisa is not to eliminate uncertainty, but to design instruments that provide robust comparisons of the performance of education systems in ways that reflect any remaining uncertainty. For that reason, for example, each country is assigned a range of ranks in the Pisa reports, rather than given a precise rank order. That uncertainty is published as part of any rankings, so it can hardly be considered a secret.

Professor Svend Kreiner and Professor Karl Bang Christensen suggest in their report (bit.lyPisaScaling) that there should be no variability in performance on individual questions between students in different countries. Little consideration is needed to realise that this idea is nonsense. For example, although all children learn maths, teachers, schools and education systems vary in the emphasis that they place on different mathematical topics and in the ways they teach these topics. So that is what you will see in the performance of students on different types of mathematical tasks.

Any test based on a few tasks that did not show variability would have to be constructed around things that are taught in all schools in exactly the same ways. It would reflect the stale, lowest common denominator of national curricula and thus would be irrelevant for policy or instructional purposes.

For this reason, it would be naive and meaningless to design a test that compares students, schools or education systems on the basis of only a few tasks, using data from a small number of students. Yet this is what Kreiner and Christensen do. They use data from only one of the 13 Pisa test booklets from the 2006 survey, and they look only at reading questions, even though the main subject of Pisa 2006 was science. Pisa samples from each country include a fully representative sample of at least 4,000 students, and in most cases more. Kreiner and Christensen, however, base their analysis on a small subset of the sample in each country, in some cases as few as 250 students.

They do not explain why they did not analyse the full data for each country. This was not because of lack of access, as they acknowledge that Pisa data are made fully and publicly available by the Organisation for Economic Cooperation and Development (OECD). Their decision to restrict the data in this way ignores the complexity of the information gathered in a realistic assessment situation. And, since their method resulted in a wider range of ranks than the method used in Pisa, it is difficult to see how they can claim that their analysis is more robust.

Imagine you were looking at a picture through the myopic lens of Kreiner and Christensen, which shows you just one pixel at a time. Of course, every time you look at the picture in that way you are going to see a different pixel that tells you a different story. But the point of Pisa is to portray the whole picture.

The Pisa tests involve several hours of assessment material that are distributed to students in 30-minute packets. Each student completes a two-hour test paper with questions drawn from this pool of assessment material. It is only when all these questions are taken into account that reliable and robust measures of individual performance can be derived. Pisa has convincingly and conclusively shown that the design of the tests and the scaling model used to score them lead to robust measures of country performance that are not affected by the composition of the item pool.

The results of these analyses are documented in the Pisa Technical Reports, and there has been considerable technical discussion of the statistical models used in Pisa, much of which has been published either on the OECD website or in academic journals and conference proceedings. The mention in TESS (26 July) that Kreiner is sceptical about Pisa may perhaps go some way towards explaining why he chooses to ignore all this.

Pisa and the analysis undertaken by Kreiner use variants of Rasch statistical models. Dr Hugh Morrison raises fundamental philosophical objections to the original Rasch model introduced in the 1960s. His paper (bit.lyMorrisonBelfast) does not discuss any of the applications and developments of this model in the past 50 years. He does not show any knowledge of the methods used in the Pisa survey and does not refer to any of the technical literature on Pisa. It is difficult to see how his paper can be considered relevant to the methodological debate.

The OECD does not see any scientific or academic merit in these papers, and considers the accusations made in TES and TESS, based on these flawed analyses, to be completely unjustified.

Andreas Schleicher is deputy director for education and special adviser on education policy to the OECD’s secretary general.