Why RCTs are effective for education research

Some find randomised controlled trials unreliable, flawed and unethical – but they are often best way to find out what works in classrooms, says Steve Higgins
15th February 2019, 12:05am


Why RCTs are effective for education research


Some believe the complexity of education makes reliable results from randomised controlled trials (RCTs) impossible. Some believe you cannot truly get comparable groups, rendering such an approach flawed. Some simply believe the method is unethical. But RCTs are necessary in education. Without them, we would miss out on an important part of the puzzle in our hunt to find out what might work in the classroom.

And the accusations often levelled at RCTs - like those above - need more careful consideration, not blind acceptance: they are easy criticisms to make but harder to prove.

So, as the push towards evidence-informed practice heightens the calls for and against RCTs - and, as teachers become better placed to critique research and take part - here’s what you might call a defence of RCTs in education. (Personally, I prefer to see this as an explanation for why an RCT may not be a perfect measure but can be a necessary step.)

First, some history. Most people assume that RCTs seeped into education from medicine. This narrative suggests that when education was seeking to become more evidence based, it lifted the practice from medical research in the belief that teachers would soon be able to boast the same as doctors: that what they were doing was, reliably, the best known way of doing it.

Certainly, it is medicine where the approach has been carefully and systematically developed, and it makes sense that this is the case: after all, it is important to know whether a new treatment will heal or harm.

The origins of the RCT can be traced back to pioneers such as James Lind, who investigated cures for scurvy in the Navy in the 1750s; Charles Peirce, who devised a randomisation scheme to avoid experimenter bias in psychology experiments in the 1880s; and Karl Pearson, in his research of vaccines for typhoid in the early 1900s.

But other fields have played their part, too. For example, much of the statistical practice to rule out chance findings was undertaken in the first half of the 20th century by Ronald Fisher in his work on crop yields at Rothamsted Experimental Station, in Hertfordshire.

The common purpose of all this was to develop an experimental approach that avoided bias, provided a fair comparison for the intervention or approach that was being tested and which took account of the risk of finding a misleading result by chance.

An RCT achieves that first by having a comparison group. This is sometimes described as the “counterfactual” condition and represents what we think would happen without the intervention or approach being tried. It controls for normal educational improvement over time.

●It also needs to have groups that are reasonably equivalent in terms of known and unknown factors that might explain any improvement. In education, this can partly be achieved through matching pupils or creating equivalent groups that are as similar as possible in terms of the known factors that might explain any differences in outcomes: these include current level of attainment, age, gender, free-school meal or special educational need and disability (SEND) status.

The advantage of randomisation is that, on average, it controls for known and unknown differences. As pupils or classes are allocated at random, any differences are randomly spread between the groups. Unexpected differences will occur, but these are by chance, not through any identifiable or unknown bias.

The general approach of an RCT is usually very similar:


• First, you decide what you want to find out: for example, how effective an educational technique - such as using peer tutoring in mathematics - is compared with normal practice in schools.

• Then, you identify a fair assessment to demonstrate this, such as a standardised test of mathematics or key stage 2 national test scores.

• Next, you recruit a large enough number of schools and classes to take part, depending on how big you expect the effect to be, and so you can be confident that any impact is the result of the peer tutoring and not other things that happen in schools. It’s like throwing two dice: you can get any result between 2 and 12. Throw them hundreds of times and the average score will be about 7 (unless the dice are loaded). The larger the sample, the more confident you can be, but the harder it is to get the approach up and running well, and the more expensive the trial becomes.

• Then you test all of the classes before you start (unless you are using existing national test data) and you assign the classes to be involved in the peer tutoring using randomisation - this is to make it fair and to avoid any possible bias (such as giving peer tutoring to your top set or the most cooperative kids or, indeed, the opposite).

• The peer tutoring or “intervention” group is given additional training to make sure the children in it know what to do, while the comparison or “control” group carries on as usual.

• After what seems like a reasonable length of time for the approach to have made a difference, you test the classes again and compare the results between the two groups. All of these steps are pre-specified to avoid deliberate or unconscious cheating.


The variations in possible trial designs are copious: you can have more than one version of the educational approach to see whether more peer tutoring is better (such as an hour twice a week, compared with four times a week); you can provide the comparison group with peer tutoring once you have collected the post-test data (and once you are sure it is effective); you could design the research to look for other differences (such as the effects on pupils eligible for free school meals or children with SEND).

The key feature is that you are setting a fair test to see whether a particular educational approach or technique has a particular impact or effect.

Sounds perfect, right? But while in medicine, RCTs are very common, in education, it is still relatively rare for research to adopt this model. A systematic review led by Professor Paul Connolly at Queen’s University Belfast found a total of 1,017 RCTs in education that have been completed and reported between 1980 and 2016. Just over three-quarters of these were undertaken in the past 10 years.

For comparison, each year there are many thousands of educational research studies published in more than 1,000 education journals. Each journal publishes several issues a year, with half a dozen or so research studies. So, only a tiny proportion of the research studies undertaken in education are randomised trials trying to find a causal link between what we do in schools and whether it has worked.

Why so few? One reason is that RCTs are expensive. A rigorous evaluation adds about 20 per cent to the cost of an intervention or new approach. Over 15 years, the Education Endowment Foundation (EEF) will have cost the taxpayer £125 million, which was the amount of the initial government endowment when the education research charity was first established in 2011 (although the EEF expects to raise up to £100 million in additional funding for educational research and development in England during this period).

But we need to keep this cost in proportion. In 2018-19 alone, we will spend around £2.4 billion on the pupil premium, enough to pay for nearly 20 EEFs each year. If the evidence from RCTs redirects even a small proportion of this so that it can be spent more effectively, surely it is worth the investment?

Another reason RCTs are relatively scarce is that the scale of the research needs to be reasonably conclusive. It is often hard to find enough schools willing to take part and to subject themselves to the conditions required to ensure a fair test. While many schools would be happy to try out an approach to test whether a catch-up programme works for a group of struggling readers, far fewer will sign up for a test to find out whether mixed attainment groups are more effective than setting across key stage 2, or are prepared to reorganise the timetable to see whether a later start to the school day suits teenagers’ learning in key stage 3.

But we definitely do need to try to make RCTs work. They address a particular type of question about effectiveness that needs to be answered alongside all the other questions we seek to resolve. There is no better way of finding out what an RCT can find out: is approach X responsible for effect Y?

An RCT is a tool, like a chisel, that has been designed and developed for a particular function. Chisels are not useful for hammering nails, sawing wood to size or fixing items with screws; by the same token, randomised trials should not be used to identify or understand the complex causal processes that lie behind an idea. But RCTs can identify what is effective, on average, and how much better it is, even if not the “how?” or the “why?”.

And yet, some do not agree that RCTs are worthwhile. They have come under heavy fire from teachers, academics and others, whose objections can be grouped into three key assumptions.


1. RCTs are not possible to undertake in education

This is clearly false: there is a long history of this kind of research in educational settings and there have been more than 150 in England since 2011.

What is interesting here is that it is proving harder to work out what is effective than we perhaps expected. Only about a third of EEF trials have a worthwhile positive impact. While negative effects are, thankfully, rare, it is harder to make a difference than we would hope and many EEF trials, especially the larger ones, do not show a conclusive benefit for what was tested - although this does not mean they do not work. This is an important finding, if hard to sell to policymakers.


2. RCTs do not reflect the real world of education

It is an easy criticism to make but very hard to argue effectively. Why? Because reflecting the real world is exactly what an RCT is trying to do. The EEF and other organisations often commission trials at three levels:

• Pilot trials, to identify promise or potential. These are usually small in scale and explore whether something is feasible.

• Efficacy trials, where something is tested across a few classrooms or schools to see whether it can work. These are often designed to give the approach a good chance, with optimal training and support.

• Effectiveness trials, which address the question “does it work in the real world and does it make a positive difference, on average, to the attainment of the pupils involved?” These are usually on a large scale, across perhaps 100 schools, with real-world conditions, where you want to establish if this is something most schools could adopt.

Each of these approaches uses an RCT design but one that is tailored to fit the particular question. I think of these as: could it work? Can it work? Does it work?

Pilot trials are a bit like trying a new fruit or vegetable in a greenhouse, efficacy trials are a bit like testing it in a polytunnel, and effectiveness trials are the equivalent to growing it in the fields.

Far from not reflecting the real world, that is exactly what large-scale RCTs test out.


3. Randomised trials are not ethical in schools

Most of what we do in schools is untested. Which is more unethical: using untested approaches or trying rigorously to find out if some approaches work better than others?

I’d argue that it is unethical not to do any randomised trials. In medicine, similar concerns have led to the development of “non-inferiority trials”, where you compare a new approach against what is regarded as best practice. If it works just as well, you have a viable alternative and can explore if it is better for some kinds of patients. If it works better, on average, you have a new version of best practice.

In education, we take a general view of typical practice as the comparison or control. It might well be useful to specify what that typical practice is, so we are clearer about what something is better than.

Another ethical issue is around consent - who should decide who needs to consent? Legally, children cannot consent (they can, however, be consulted and give their agreement or assent). Changes in legislation mean that researchers can use a “public good” defence and not ask for individual consent from parents. This is often where the research uses publicly available data from key stage tests or examinations.

Asking for consent is not simple. It is expensive and time-consuming. The legal changes mean that you need a definite response (sometimes called “opt-in”). In some schools, you can get a low return rate and this can affect the robustness of the research by skewing your sample. You also have to decide whether you prevent children whose parents have not consented from a potentially beneficial intervention, or whether you just don’t record their data.

But let’s be clear: research in education usually has a low ethical risk. We are trying to bring about improvement, the question is the extent of the improvement, so the risk of harm is usually very low.

Which is worse, trying to find out rigorously, or not trying to find out and acting on beliefs and assumptions?


RCTs are not a panacea. We must be clear what they can and cannot do. Overstating their impact is as bad as understating it.

The results from an effective design tell you that, in a particular study, you can be reasonably sure that any average difference (in terms of the specific tests you have used and only over the duration of the research project) was the result of the specific approach you have tested.

One study of peer tutoring that has a positive impact does not tell you that peer tutoring “works” for all schools, even in the same subject and with the same-aged pupils. You don’t know if the effects last or just wash out after a few months. We need to be honest about this, and aware of it when we are reading and interpreting research.

However, RCTs are the only research design that can tell us if something has actually “worked” in a particular study. What we need to do is not banish them but get better at deciding what is worth putting under the RCT microscope.

Without them, we will always have to rely on our intuition and judgement, which are fallible and prone to human bias. An RCT is an integral part of the research process.

If we can never tell with reasonable certainty that a particular approach had a particular effect, there is not much point in working out why or how.

Steve Higgins is professor of education at Durham University and author of the Sutton Trust-EEF Teaching and Learning Toolkit

This article originally appeared in the 8 February 2019 issue under the headline “Why RCTs should be education’s BFF”

You need a Tes subscription to read this article

Subscribe now to read this article and get other subscriber-only content

  • Unlimited access to all Tes magazine content
  • Exclusive articles and email newletters

Already registered? Log in

You need a subscription to read this article

Subscribe now to read this article and get other subscriber-only content, including:

  • Unlimited access to all Tes magazine content
  • Exclusive articles and email newsletters