The Myths about Replication

I have talked about replication a lot in my previous posts and why I believe it is central to healthy science. Unfortunately, a lot of myths surround replication and how it should be done. The most common assertion you will hear amongst my colleagues about replication is that “we should be doing more of it”. Now at some level I don’t really disagree with this of course. However, I think these kinds of statements betray a misunderstanding of what replication is and how science actually works in practice.

Replication is at the heart of science

Scientists attempt replication all of the time. Most experiments, certainly all of the good ones, include replication of previous findings as part of their protocol. This is because it is essential to have a control condition or a “sanity check” on which to build future research. In his famous essay/lecture “Cargo Cult Science” Richard Feynman decries that there was apparently a widespread lack of understanding for this issue in the psychological sciences. Specifically, he describes how he advised a psychology student:

“One of the students told me she wanted to do an experiment that went something like this – it had been found by others that under certain circumstances, X, rats did something, A. She was curious as to whether, if she changed the circumstances to Y, they would still do A. So her proposal was to do the experiment under circumstances Y and see if they still did A. I explained to her that it was necessary first to repeat in her laboratory the experiment of the other person – to do it under condition X to see if she could also get result A, and then change to Y and see if A changed. Then she would know the real difference was the thing she thought she had under control. She was very delighted with this new idea, and went to her professor. And his reply was, no, you cannot do that, because the experiment has already been done and you would be wasting time.”

The reaction of this professor is certainly foolish. Every experiment we do should build on previous findings and reconfirm previous hypotheses before attempting to address any new questions. Of course, this was written decades ago. I don’t know if things have changed dramatically since then but I can assure my esteemed colleagues amongst the ranks of the Crusaders for True Science that this sort of replication is common in cognitive neuroscience.

Let me give you some examples. Much of my alter ego’s research employs a neuroimaging technique called retinotopic mapping. In these experiments subjects lie inside an MRI scanner whilst watching flickering images presented at various locations on a screen. By comparing which parts of the brain are active when particular positions in the visual field are stimulated with images, experimenters can construct a map of how a person’s field of view is represented in the brain.

We have known for a long time that such retinotopic maps exist in the human brain. It all started with Tatsui Inouye, a Japanese doctor, who studied soldiers who had bullet wounds that destroyed part of their cerebral cortex. He noticed that many patients experienced blindness at selective locations in their visual field. He managed to reconstruct the first retinotopic map by carefully plotting the correspondence of blind spots with the location of bullet wounds in certain parts of the brain.

Early functional imaging device for retinotopic mapping

With the advent of neuroimaging, especially functional MRI, came the advance that we no longer need to shoot bullets into people’s heads to do retinotopic mapping. Instead we can generate these maps non-invasively within a few minutes of scan time. Moreover, unlike these earlier neuroanatomical studies we can now map responses in brain areas where bullet wounds (or other damage) would not cause blindness. Even the earliest fMRI studies discovered additional brain areas that are organized retinotopically. More areas are being discovered all the time. Some areas we could expect to find based on electrophysiological experiments in monkeys. Others, however, appear to be unique to the human brain. Like with all science, the definition of these areas is not always without controversy. However, at this stage only a fool would doubt the existence of retinotopic maps and that they can be revealed with fMRI.

Not only that, but it is clearly a very stable feature of brain organization. Maps are pretty reliable on repeated testing even when using a range of different stimuli for mapping. The location of maps is also quite consistent between individuals so that if two people fixate the center of an analog clock, the number 3 will most likely drive neurons in the depth of the calcarine sulcus in the left cortical hemisphere to fire, while the number 12 will drive neurons in the lingual gyrus atop the lower lip of the calcarine. This generality is so strong that anatomical patterns can be used to predict the locations and borders of retinotopic brain areas with high reliability.

Of course, a majority of people will probably accept that retinotopy is one of the most replicated findings in neuroimaging. Retinotopic mapping analysis is even fairly free of concerns about activation amplitudes and statistical thresholds. Most imaging studies are not so lucky. They aim to compare the neural activity evoked by different experimental conditions (e.g. images, behavioral tasks, mental states), and then localize those brain regions that show robust differences in responses. Thus the raw activation levels in response to various conditions, and the way it is inferred statistically, is central to the interpretation of such imaging experiments.

This means that many studies, in particular in the early days of neuroimaging, were solely focused on this kind of question, to localize the brain regions responding to particular conditions. This typically results in brain images with lots of beautiful, colorful blobs being superimposed on an anatomical brain scan. For this reason this approach is often referred to by the somewhat derogatory term  “blobology” and it is implicitly (or sometimes explicitly) likened to phrenology because “it really doesn’t show anything about how the brain works”. I think this view is wrong, although it is certainly correct that localizing brain regions responding to particular conditions on its own cannot really explain how the brain works. This is in itself an interesting discussion but it is outside the scope of this post. Critically for the topic of replication, however, we should of course expect these blob localizations to be reproducible when repeating the experiments in the same as well as different subjects if we want to claim that these blobs convey any meaningful information about how the brain is organized. So how do such experiments hold up?

Some early experiments showed that different brain regions in human ventral cortex responded preferentially to images of faces and houses. Thus these brain regions were named – quite descriptively – fusiform face area and parahippocampal place area. Similarly, the middle temporal complex responds preferentially to moving relative to static stimuli, the lateral occipital complex responds more to intact, coherent objects or textures than to scrambled, incoherent images, and there have even been reports of areas responding preferentially to images of bodies or body parts and even to letters and words.

The nature of neuroimaging, both in terms of experimental design and analysis but also simply the inter-individual variability in brain morphology, means that there is some degree of variance in the localization of these regions when comparing the results across a group of subjects or many experiments. In spite of this, the existence of these brain regions and their general anatomical location is by now very well established. There have been great debates regarding what implications this pattern of brain responses has and whether there could be alternative, possibly more trivial, factors causing an area to respond. This is however merely the natural scrutiny and discussion that should accompany any scientific claims. In any case, regardless of what the existence of these brain regions may mean, there can be little doubt that these findings are highly replicable.

Experiments that aim to address the actual function of these brain regions are in fact perfect examples of how replication and good science are closely entwined. What such experiments typically do is to first conduct a standard experiment, called a functional localizer, to identify the brain region showing a particular response pattern (say, that it responds preferentially to faces). The subsequent experiments then seek to address a new experimental question, such as whether the face-sensitive response can be explained more trivially by basic attributes of the images. It is a direct example of the type of experiment Feynman suggested in that the experimenter first replicates a previous finding and then tests what factors can influence this finding. These replications are at the very least conceptually based on previously published procedures – although in many cases they are direct replications because functional localizer procedures are often shared between labs.

This is not specific to neuroimaging and cognitve neuroscience. Similar arguments could no doubt be made about other findings in psychology, for example considering the Stroop effect, and I’m sure it applies to most other research areas. The point is this: none of these findings were replicated because people deliberately set out to test the out the validity in a “direct replication” effort. There was no reprodicibility project for retinotopic mapping, no “Many Labs” experiment to confirm the existence of the fusiform face area. These findings have become replicated simply because researchers included tests of these previous results in their experiments, either as sanity checks or because they were an essential prerequisite for addressing their main research question.

Who should we trust?

In my mind, replication should always occur through this natural process. I am deeply skeptical of concerted efforts to replicate findings simply for the sake of replication. Science should be free of dogma and not motivated by an agenda. All too often the calls for why some results should be replicated seem to stem from a general disbelief in the original finding. Just look at the list of experiments on PsychFileDrawer that people want to see replicated. It is all fair and good to be skeptical of previous findings, especially the counter-intuitive, contradictory, underpowered, and/or those that seem “too good to be true”.

But it’s another thing to go from skepticism to actually setting out to disprove somebody else’s experiment. By all means, set out to disprove your own hypotheses. This is something we should all do more of and it guarantees better science. But trying to disprove other people’s findings smells a lot of crusading to me. While I am quite happy to give most of the people on PsychFileDrawer and in the reproducibility movement the benefit of the doubt that they are genuinely interested in the truth, whenever you approach a scientific question with a clear expectation in mind, you are treading on dangerous ground. It may not matter how cautious and meticulous you think you are in running your experiments. In the end you may just accumulate evidence to confirm your preconceived notions and this is not contributing much to advance scientific knowledge.

I also have a whole list of research findings of which I remain extremely skeptical. For example, I am wary of many claims about unconscious perceptual processing. I can certainly accept that there are simple perceptual phenomena, such as the tilt illusion, that can occur without conscious awareness of the stimuli because these processes are clearly so automatic that it is impossible to not experience them. In contrast, I find the idea that our brains process complex subliminal information, such as reading a sentence or segmenting a visual scene, pretty hard to swallow. I may of course be wrong but I am not quite ready to accept that conscious thought plays no role whatsoever in our lives, as some researchers seem to imply. In this context, I remain very skeptical of the notion that casually mentioning words that remind people of the elderly (like “Florida”) makes them walk more slowly or that showing a tiny American flag in the corner of the screen influences their voting behavior several months in the future. And like my host Sam (who has written far too much on this topic), I am extremely skeptical of claims of so-called psi effects, that is, precognition, telepathy, or “presentiment”. I feel that such findings are probably far more likely to be explained by more trivial explanations and that the authors of such studies are too happy to accept the improbable.

But is my skepticism a good reason for me to replicate these experiments? I don’t think so. It would be unwise to investigate effects that I trust so little. Regardless of how you control your motivation, you can never be truly sure that it isn’t affecting the quality of the experiment in some way. I don’t know what processes underlie precognition. In fact, since I don’t believe precognition exists it seems difficult to even speculate about the underlying processes. So I don’t know what factors to watch out for. Things that seem totally irrelevant to me may be extremely important. It is very easy to make subtle effects disappear by inadvertently increasing the noise in our measurements. The attitude of the researcher alone could influence how a research subject performs in an experiment. Researchers working on animals or in wet labs are typically well aware of this. Whether it is a sloppy experimental preparation or a disinterested experimenter training an animal on an experimental task, there are countless reasons why a perfectly valid experiment may fail. And if this can happen for actual effects, imagine how much worse it must be for effects that don’t exist in the first place!

As long as the Devil’s Neuroscientist has her fingers in Sam’s mind, he won’t attempt replicating ganzfeld studies or other “psi” experiments no matter how much he wants to see them replicated. See? I’m the sane one of the two of us!

Even though in theory we should judge original findings and attempts of replication by the same standard, I don’t think this is really what is happening in practice – because it can’t happen if replication is conducted in this way. Jason Mitchell seems to allude to this also in a much maligned commentary he published about this topic recently. There is an asymmetry inherent to replication attempts. It’s true, researcher degrees of freedom, questionable research practices and general publication bias can produce false positives in the literature. But still, short of outright fraud it is much easier to fail to replicate something than it is to produce a convincing false positive. And yet, publish a failed replication of some “controversial” or counter-intuitive finding, and enjoy the immediate approving nods and often rather unveiled shoulder slapping within the ranks of the Jihadists of Scientific Veracity.

Instead what I would like to see more of is scientists building natural replications into their own experiments. Imagine, for instance, that someone published the discovery of yet another brain area responding selectively to a particular visual stimulus, images of apples, but not others like tools or houses. The authors call this the occipital apple area. Rather than conducting a basic replication study repeating the methods of the original experiment step-by-step, you should instead seek to better understand this finding. At this point it of course remains very possible that this finding is completely spurious and that there simply isn’t such a thing as a occipital apple area. But the best way to reveal this is to test alternative explanations. For example, the activation in this brain region could be related to a more basic attribute of the images, such as the fact that the stimuli were round. Alternatively, it could be that this region responds much more generally to images of food. All of these ideas are straightforward hypotheses that make testable predictions. Crucially though, all of these experiments also require replication of the original result in order to confirm that the anatomical location of the region processing stimulus roundness or general foodstuffs actually corresponds to the occipital apple area reported by the original study.

Here are two possible outcomes of this experiment: in the first example you observe that when using the same methods as the original study you get a robust response to apples in this brain region. However, you also show that a whole range of other non-apple images evoke strong fMRI responses in this brain regions provided the depicted objects are round. Responses do not show any systematic relationship with whether the images are of food. They also respond to basketballs, faces, and snow globes but not to bananas or chocolate bars. Thus it appears like you have confirmed the first hypothesis, that the occipital apple area is actually an occipital roundness area. This may still not be the whole story but it is fairly clear evidence that this area doesn’t particularly care about apples.

Now compare this to the second situation: here you don’t observe any responses in this brain region to any of the stimuli, including the very same apples from the original experiment. What does this result teach us? Not very much. You’ve failed to replicate the original finding but any failure to replicate could very likely result from any number of factors we don’t as yet understand. As Sam (and thus also I) recently learned in a talk in which Ap Dijksterhuis discussed a recent failure to replicate of one of his social priming experiments, psychologists apparently call such factors “unknown moderators”.

Ap Dijksterhuis whilst debating with David Shanks at UCL

I find the explanation of failed replications by unknown moderators somewhat dissatisfying but of course such factors must exist. They are the unexplained variance I discussed in my previous posts. But of course there is a simpler explanation: that the purported effect simply doesn’t exist. The notion of unknown moderators is based on the underlying concept driving all science, that is, the idea that the universe we inhabit is governed by certain rules so that conducting an experiment with suitably tight control of the parameters will produce consistent results. So if you talk about “moderators” you should perform experiments testing the existence of such moderating factors. Unless you have evidence for a moderator, any talk about unknown moderators is just empty waffle.

Critically, however, this works both ways. As long as you can’t provide conclusive evidence as to why you failed to replicate, your wonderful replication experiment tells us sadly little about the truth behind the occipital apple area. Comparing the two hypothetical examples, I would trust the findings of the first example far more than the latter. Even if you and others repeat the apple imaging experiment half a dozen times and your statistics are very robust, it remains difficult to rule out that these repeated failures aren’t due to something seemingly-trivial-but-essential that you overlooked. I believe this is what Jason Mitchell was trying to say when he wrote that “unsuccessful experiments have no meaningful scientific value.”

Mind you, I think Mitchell’s commentary is also wrong about a great many things. Like other brave men and women standing up to the Crusaders (I suppose that makes them the “Heathens of Psychology Research?”) he implies that a lot of replications fail because they are conducted by researchers with little or no expertise in the research they seek to replicate. He also suggests that there are many procedural details that aren’t reported in methods sections of scientific publications, such as the fact that participants in fMRI experiments are instructed not to move. This prompted Sam’s colleague Micah Allen to create the Twitter hashtag #methodswedontreport. There is a lot of truth to the fact that some methodological details are just assumed common knowledge. However, the hashtag quickly deteriorated into a comedy vent because – you know – it’s Twitter.

In any case, while it is true that a certain level of competence is necessary to conduct a valid replication, I don’t think Mitchell’s argument holds water here. To categorically accuse replicators of incompetence simply because they have a different area of expertise is a logical fallacy. I’ve heard these arguments so many times. Whether it is about Bem’s precognition effects, Bargh’s elderly priming, or whatever other big finding people failed to replicate, the argument is often made that such effects are subtle and only occur under certain specific circumstances that only the “expert” seems to know about. My alter ego, Sam, has faced similar criticisms when he published a commentary about a parapsychology article. In some people’s eyes you can’t even voice a critical opinion about an experiment, let alone try to replicate them, if you haven’t done such experiments before. Doesn’t anybody perceive something of a catch-22 here?

Let’s be honest here. Scientists aren’t wizards and our labs aren’t ivory towers. The reason we publish our scientific findings is (apart from building a reputation and hopefully reaping millions of grant dollars) that we must communicate them to the world. Perhaps in previous centuries some scientists could sit in seclusion and tinker happily without anyone ever finding out the great truths about the universe they discovered. In this day and age this approach won’t get you very far. And the fact that we actually know about the great scientists of Antiquity and the Renaissance shows that even those scientists disseminated their findings to the wider public. And so they should. Science should benefit all of humanity and at least in modern times it is also often paid for by the public. There may be ills with our “publish or perish” culture but it certainly has this going for it: you can’t just do science on your own simply to satisfy your own curiosity.

The best thing about Birmingham is that they locked up all the psychology researchers in an Ivory Tower #FoxNewsFacts

Part of science communication is to publish detailed descriptions of the methods we use. It is true that some details that should be known to anyone doing such experiments may not be reported. However, if your method section is so sparse in information that it doesn’t permit a proper replication by anyone with a reasonable level of expertise, then it is not good enough! It is true, I’m no particle physicist and I would most likely fail miserably if I tried to replicate some findings from the Large Hadron Collider – but I sure as Hell office should be expected to do reasonably well at replicating a finding from my own general field even if I have never done this particular experiment before.

I don’t deny that a lack of expertise with a particular subject may result in some incompetence and some avoidable mistakes – but the burden of proof for this lies not with the replicator but with the person asserting that incompetence is the problem in the first place. By all means, if I am doing something wrong in my replication, tell me what it is and show me that it matters. If you can’t, your unreported methods or unknown moderators are completely worthless.

What can we do?

As I have said many times before, replication is essential. But as I have tried to argue here, I believe we should not replicate for replication’s sake. Replication of previous findings should be part of making new discoveries. And when that fails the onus should be on you to find out why. Perhaps the previous result was a false positive but if so you aren’t going to prove it with one or even a whole series of failed replications. You can however support the case by showing the factors that influence the results and testing alternative explanations. One failed replication of social priming effects caused a tremendous amount of discussion – and considerable ridicule for the original author because of the way in which he responded to his critics. For the Crusaders this whole affair seems to be a perfect example of the problems with science. However, ironically, it is actually one of the best examples of how a failed replication should look. The authors failed to replicate the original finding but they also tested a specific alternative explanation: that the priming effect was not caused by the stimuli to which the subjects were exposed but by the experimenters’ expectations. Now I don’t know if this study is any more true than the original one. I don’t know if these people simply failed to replicate because they were incompetent. They were clearly competent enough to find an effect in another experimental condition so it’s unlikely to just be that.

Contrast this case with the disagreement between Ap Dijksterhuis and David Shanks. On the one hand, you have nine experiments failing to replicate the original result. On the other hand, during the debate I witnessed with Sam’s eyes, you have Dijksterhuis talking about unknown moderators and wondering about whether Shanks’ experiments were done inside cubicles (they were, in case you were wondering – another case of #methodswedontreport). Is this the “expertise” and “competence” we need to replicate social priming experiments? The Devil is not convinced. But either way, I don’t think this discussion really tells us anything.

David Shanks as he regards studies claiming subconscious priming effects (No, he doesn’t actually look like that)

So we should make replication part of all of our research. When you are skeptical of a finding test alternative hypotheses about it. And more generally, always retain healthy skepticism. By definition any new finding will have been replicated less often than old findings that have made their way into the established body of knowledge. So when it comes to newly published results we should always reserve judgment and wait for them to be repeated. That doesn’t mean we can’t get excited about new surprising findings – in fact, being excited is a very good reason for people to want to replicate a result.

There are of course safeguards we can take to maximize the probability that new findings are solid. The authors of any study must employ appropriate statistical procedures and interrogate their data from various angles, using different analysis approaches, and employing a range of control experiments in order to ascertain how robust the results are. While there is a limit on how much of that we can expect any original study to do, there is certainly a minimum of rigorous testing that any study should fulfill. It is the job of peer reviewers, post-publication commenters, and of course also the researchers themselves to think of these tests. The decision of how much evidence suffices for an original finding can be made in correspondence with journal editors. These decisions will sometimes be wrong but that’s life. More importantly, regardless of a study’s truthiness, it should never be regarded as validated until it has enjoyed repeated replication by multiple studies from multiple labs. I know I have said it before but I will keep saying it until the message is finally getting through: science is a slow, gradual process. Science can approach truths about the universe but it doesn’t really ever have all the answers. It can come tantalizingly close but there will always be things that elude our understanding. We must have patience.

The discovery of universal truths by scientific research moves at a much slower pace than this creature from the fast lane.

Another thing that could very well improve the state of our field is something that Sam has long argued for and about which I actually agree with him (I told you scientists disagree with one another all the time and this even includes those scientists with multiple personalities). There ought to be a better way to quantify how often and how robustly any given finding has been replicated. I envision a system similar to Google Scholar or PubMed in which we can search for a particular result, say, a brain-behavior correlation, the location of the fusiform face area, of arsenic-based life forms, or of precognitive psychology experiments. The system then not only finds the original publication but displays a tree structure linking the original finding to all of the replication attempts and whether they were successful, and in how far they were direct or conceptual replications. A more sophisticated system could allow direct calculation of meta-analytical parameter estimates, for example to narrow down the stereotactic coordinates of the reported brain area or the effect size of a brain-behavior correlation.

Setting up such a system will certainly require a fair amount of meta-information and a database that permits the complex links between findings. I realize that this represents a fair amount of effort but once this platform is up and running and has become part of our natural process the additional effort will probably be barely noticeable. It may also be possible to automate many aspects of building this database.

Last but not least, we should remember that science should seek to understand how the world works. It should not be about personal vendettas and attachment or opposition to particular theories. I think it would benefit both the defending Heathens and the assaulting Crusaders to consider that more. Science should seek explanations. Trying to replicate a finding, simply because it seems suspect or unbelievable to you, is not science but more akin to clay pigeon shooting. Instead of furthering our understanding we just remain on square one. But to the Heathens I say this: if someone says that your theory is wrong or incomplete, or that your result fails to replicate, they aren’t attacking you. We are allowed to be wrong occasionally –  as I said before, science is always wrong about something. Science should be about evidence and disagreement, debate, and continuous overturning of previously held ideas. The single best weapon in the fight against the tedious Crusaders is this:

The fiercest critic of your own research should be you.