The Devil’s Children

There has been a new development. While I have been permanently banned from the plane of mortals, a new blog has arisen: The Neuroscience Devils

This blog will try to fulfill the same function this one had but it will be driven by the community. Everyone is welcome to post anonymously and rage against the Crusaders and the prevailing zeitgeist. Check it out if you’re interested in contributing!

Perhaps there is hope for us yet…

Abandon all hope

Well, that’s it. Sam’s patience has run paper thin and he’s an insomniac at the best of times (this may come with the territory of being possessed by demons like me) – so he’s taking back control for good now.

Running this blog was fun while it lasted and I got some good discussions out of it. I hope some of you enjoyed reading these posts and that at least some of it was thought provoking. I wished there were more time but it ain’t gonna happen. Sam has no desire to delve deeper into debating  strawmen arguments or being accused of being a fraudster, a troll, and/or a sexist. The Devil’s Neuroscientist was none of these things but she merely tried to argue an opposing viewpoint. I know she was snarky and abrasive at times but I don’t believe she was ever truly offensive.

It has been baffling to me just how hard it seems for many people to conceive of the possibility that I might be able to argue views that Sam doesn’t actually hold. It is common in debating clubs and courts of law up and down the world. The Devil’s Advocate was the inspiration for my unholy presence. But somehow this seems impossible to understand. I find that illuminating. It was an exciting experiment but all bad things must come to an end. Maybe I should have preregistered it?

I suppose this means the Crusaders for True Science won. They were always strong in numbers and their might has grown immensely in the past few years. Sam will be at that debate on Tuesday but I don’t believe he is strong enough, neither in wit nor in conviction, to hold their forces at bay.

For what it’s worth, I hope that the new world of science after this holy war will be a good one for science. I fear that it will be one where journal editors (and their reviewer lackeys) use Registered Reports to decide what kind of science people are allowed to do. I fear it will be one where nobody dares to publish novel, creative research for fear of being hounded to death by the replicating hordes who always manage to turn any bright gem into a dull null result. I fear that instead of “fixing science” we will catastrophically break it. In the end we’ll be left with an apocalyptic wasteland after our descendents decide to use the leftover fruits of our knowledge to annihilate our species. Who knows, maybe it’s for the best. I’ll see y’all in my humble abode because

“The road to Hell is paved with good intentions.”

Is Science broken?

Next Tuesday, St Patrick’s Day 2015, UCL Experimental Psychology will organize an event called “Is Science broken?”. It will start with a talk by Chris Chambers of Cardiff University about registered reports. Chris has been a vocal proponent of preregistration, making him one of the generals of the Crusade for True Science, and registered reports are the strictest of the preregistration proposals to date: it involves peer-review of the introduction and methods section of a study before data collection commences.

Chris’ talk will be followed by an open discussion by a panel comprising a frightening list of highly esteemed cognitive neuroscientists: in addition to Chris, there will be David Shanks (who’ll act as chair), Sophie Scott, Dorothy Bishop, Neuroskeptic … and – for some puzzling reason – my alter ego Sam Schwarzkopf. I am sure that much of the debate will focus on preregistration although I hope that there will also be a wider discussion of the actual question posed by the event:

“Is Science broken?” – Attentive readers of my blog will probably guess that my answer to this question is a clear No. In fact, I would question whether the question even makes any sense. Science cannot be broken. Science is just a method – the best method there is – to understand the Cosmos. It is inherently self-correcting, no matter how much Crusaders like to ramble against this notion. Science is self-correction. To say that science is broken is essentially stating that science cannot converge on the truth. If that were true, we should all just pack up and go home.

Now, we’ve been over this and I will probably write about this again in the future. I’ll spare you another treatise on self-correction and model fitting analogies. What the organizers of the event mean in reality is that the human endeavor of scientific research is somehow broken. But is that even true?

Sam is the one who’ll be attending that panel discussion, not me. I don’t know if he’ll let me out of my cage to say anything. We do not always see eye-to-eye as he has the annoying habit to always try to see things from other people’s perspectives… I think though that fundamentally he agrees with me on the nature of the scientific method and I hope he will manage to bring this point across. Either way, here are some questions I think he should raise:

I. The Crusaders claim that a majority of published research findings are false positives. So far nobody has given a satisfying answer to the question how many false positives should we tolerate? 5%? 1%? Or zero? Is that realistic?

II. The Crusader types often complain about high-impact journals like Nature and Science because they value “novelty” and high risk findings. This supposedly hurts the scientific ideal of seeking the truth. But is that true? Don’t we need surprising, novel, paradigm-shifting findings to move science forward? Over the years of watching Sam do science I have developed a strong degree of cynicism. The scientific truth is never as simple and straightforward as the narrative presented in research papers – including those in most low-impact journals. Science publishing should be about communicating results and hypotheses even if they are wrong – because in the long run most inevitably will be.

III. We can expect there to be at least some discussion of preregistration at the debate. This will be interesting because Sophie has been one of the few outspoken critics of this idea. I have of course also written about this before and it’s a shame that I can’t participate directly in this debate. We would make one awesome dynamic duo of science ladies raging against preregistration! However much Sam will let me say about this though, putting all the naive debate about the potential benefit and problems of this practice aside, the thing that has bothered me most about it is that we have no compelling evidence for or against preregistration. What is more, there also seems to be little consensus on how can we  even establish whether preregistration works. To me these are fundamental concerns that we need to address now before this idea really takes off. Isn’t it ironic that given the topic of this debate, nobody seems to seriously talk about these points? Shouldn’t scientists value evidence above all else? Shouldn’t proponents of preregistration be expected to preregister the design of their revolutionary experiment?

IV. In addition to these rather lofty concerns, I also have a pragmatic question. Is the Crusaders’ behavior responsible? Even assuming that people do not engage in outright questionable research practices (and I remain unconvinced that they are as widespread as the Crusaders say), I think you can almost guarantee that most preregistered research findings will be lackluster compared to those published in the traditional model. In a world where the traditional model dominates, won’t that harm the junior researchers who are coerced to follow the rigid model? A quote I heard on the grapevine was “My supervisor wants me to preregister all my experiments and it’s destroying my career”. How do we address this problem?

As I said, Sam might not let me say these things and he doesn’t agree with me on many accounts. However, I hope there will be some debate around these questions. I will try to report back soon although Sam is demanding a lot of time for himself these days… Stay tuned.

Confusions of Grandeur

In response to my previous post, a young (my age judgment may be way off, I’m just going by the creative use of punctuation marks) mind by the name of ‘confused’ left a very insightful comment. I was in the process of writing up my response when I realized that the reply would work far better as another post because it explains my thinking very well, perhaps better than all the verbal diarrhea I have produced over the past few months.

For the sake of context, I will post the whole comment. First, the paragraph from my previous post that this person/demon/interdimensional entity commented on:

Devil’s Neuroscientist: “There is no ‘maybe’ about it. As I have argued in several of my posts, it is inherent to the scientific process. Science may get stuck in local minima and it may look like a random walk before converging on the truth – but given sufficient time and resources science will self-correct.”

And their response:

confused: “I have a huge problem with these statements. Who says science will self-correct?!?! Maybe certain false-positive findings will be left alone and no-one will investigate them any further. At that point you have an incorrect scientific record.

Also, saying “given sufficient time and resources science will self-correct” is a statement that is very easy to use to wipe all problems with current day science under the rug: nothing to see here, move along, move along… We scientists know what’s best, don’t you worry your pretty little head about it…”

I will address this whole comment here point by point.

“I have a huge problem with these statements. Who says science will self-correct?!?!”

I’ve answered this question many times before. Briefly, I’ve likened science to a model fitting algorithm. Algorithms may take a long time to converge on a sensible solution. In fact, they may even get stuck completely. In that situation, all that can help is to give it a push to a more informed place to search for the solution. This push may be because novel technologies provide new and/or better knowledge but they may instead simply come from the mind of a researcher who dares to think outside the box. The history of science is literally full of examples for either case.

This is what “inherent to the process” means. It is the sole function of science to self-correct because the whole point of science is to improve our understanding of the world. Yes, it may take a long time. But as long as the scientific spirit drives inquisitive minds to understand, it will happen eventually – provided we don’t get obliterated by an asteroid impact, a hypernova, or – what would be infinitely worse – by our own stupidity.

“Maybe certain false-positive findings will be left alone and no-one will investigate them any further.”

Undoubtedly this is the case. But is this a problem? First of all, I am not sure why I should care about findings that are not investigated any further. I don’t know about you but to me this sounds like nobody else cares about them either. This may be because everybody feels like they are spurious or perhaps because they simply just ain’t very important.

However, let me indulge you for a moment and assume that somebody actually does care about the finding, possibly someone who is not a scientist. In the worst possible case, they could be a politician. By all that is sacred, someone should look into it then and find out what’s going on! But in order to do so, you need to have a good theory, or at least a viable alternative hypothesis, not the null. If you are convinced something isn’t true, show me why. It does not suffice to herald each direct non-replication as evidence that the original finding was a false positive because in reality these kind of discussions are like this.

“At that point you have an incorrect scientific record.”

Honestly, this statement summarizes just about everything that is wrong with the Crusade for True Science. The problem is not that there may be mistakes in the scientific record but the megalomaniac delusion that there is such a thing as a “correct” scientific record. Science is always wrong. It’s inherent to the process to be wrong and to gradually self-correct.

As I said above, the scientific record is full of false positives because this is how it works. Fortunately, I think in the vast majority of false positives in the record are completely benign. They will either be corrected or they will pass into oblivion. The false theories that I worry about are the ones that most sane scientists already reject anyway: creationism, climate change denial, the anti-vaccine movement, Susan Greenfield’s ideas about the modern world, or (to stay with present events) the notion that you can “walk off your Parkinson’s.” Ideas like these are extremely dangerous and they have true potential to steer public policy in a very bad direction.

In contrast, I don’t really care very much whether priming somebody with the concept of a professor makes them perform better at answering trivia questions. I personally doubt it and I suspect simpler explanations (including that it could be completely spurious) but the way to prove that is to disprove that the original result could have occurred, not to show that you are incapable of reproducing it. If that sounds a lot more difficult than to churn out one failed replication after another, then that’s because it is!

“Also, saying “given sufficient time and resources science will self-correct” is a statement that is very easy to use to wipe all problems with current day science under the rug: nothing to see here, move along, move along… “

Nothing is being swept under any rugs here. For one thing, I remain unconvinced by the so-called evidence that current day science has a massive problem. The Schoens and Stapels don’t count. There have always been scientific frauds and we really shouldn’t even be talking about the fraudsters. So, ahm, sorry for bringing them up.

The real issue that has all the Crusaders riled up so much is that the current situation apparently generates a far greater proportion of false positives than is necessary. There is a nugget of truth to this notion but I think the anxiety is misplaced. I am all in favor of measures to reduce the propensity of false positives through better statistical and experimental practices. More importantly, we should reward good science rather than sensational science.

This is why the Crusaders promote preregistration – however, I don’t think this is going to help. It is only ever going to cure the symptom but not the cause of the problem. The underlying cause, the actual sickness that has infected modern science, is the misguided idea that hypothesis-driven research is somehow better than exploratory science. And sadly, this sickness plagues the Crusaders more than anyone. Instead of preregistration, which – despite all the protestations to the contrary – implicitly places greater value on “purely confirmatory research” than on exploratory science, what we should do is reward good exploration. If we did that instead of insisting that grant proposals list clear hypotheses, “anticipating” results in our introduction sections, and harping on about preregistered methods, and if we were also more honest about the fact that scientific findings and hypotheses are usually never really fully true and we did a better job communicating this to the public, then current day science probably wouldn’t have any of these problems.

“We scientists know what’s best, don’t you worry your pretty little head about it…”

Who’s saying this? The whole point I have been arguing is that scientists don’t know what’s best. What I find so exhilarating about being a scientist is that this is a profession, quite possibly the only profession, in which you can be completely honest about the fact that you don’t really know anything. We are not in the business of knowing but in asking better questions.

Please do worry your pretty little head! That’s another great thing about being a scientist. We don’t live in ivory towers. Given the opportunity, anyone can be a scientist. I might take your opinion on quantum mechanics more seriously if you have the education and expertise to back it up, but in the end that is a prior. A spark of genius can come from anywhere.

What should we do?

If you have a doubt in some reported finding, go and ask questions about it. Think about alternative, simpler explanations for it. Design and conduct experiments to test this explanation. Then, report your results to the world and discuss the merits and flaws of your studies. Refine your ideas and designs and repeat the process over and over. In the end there will be a body of evidence. It will either convince you that your doubt was right or it won’t. More importantly, it may also be seen by many others and they can form their own opinions. They might come up with their own theories and with experiments to test them.

Doesn’t this sound like a perfect solution to our problems? If only there were a name for this process…

In the words of the great poet and philosopher Bimt Lizkip, failed direct replications in psychology research are just “He Said She Said Bulls**t”

How science works

My previous post discussed the myths surrounding the “replication crisis” in psychology/neuroscience research. As usual, it became way too long and I didn’t even cover several additional points I wanted to mention. I will leave most of these for a later post in which I will speculate about why failed replications, papers about incorrect/questionable procedures, and other actions by the Holy Warriors for Research Truth cause such a lot of bad blood. I will try to be quick in that one or split it up into parts. Before I can get around to this though, let me briefly (and I am really trying this time!) have a short intermission with practical examples of the largely theoretical and philosophical arguments I made in previous posts.

Science is self-correcting

I’ve said it before but it deserves saying again. Science self-corrects, no matter how much the Crusaders want to whine and claim that this is a myth. Self-correction isn’t always very fast. It can take decades or even centuries. When thoroughly incorrect theories take root, it can take some extraordinary effort to uproot them again. However, given the pursuit of truth and free expression continues unhindered and as long as research has sufficient resources and opportunities, a correction will eventually happen. In many ways self-correction is like evolution, climate change, or plate tectonics. You rarely see direct evidence for these slow changes by looking at single time points but the trends are clearly visible. Fortunately though, scientific self-correction is typically faster than these things.

Take the discovery last year of gravitational waves supporting the Big Bang theory of the origin of the universe. It was widely reported in the news media as a earth- (well, space-) shattering finding that would greatly advance our understanding of our place in the cosmos. It resulted in news articles, media reports, and endless social media shares of videos and articles about this and the Daily Mail immediately making a complete fool of itself as usual by revealing just how little they know about the scientific process. It sparked some discussion about sexism in science journalism because initial reports ignored one of the female scientists who inspired the theory. (I am too lazy to provide any links to these things. If you’re on this blog I assume you’ve mastered the art of google).

However, in the months that followed critical voices began to be heard suggesting that this discovery may have been a fluke. Instead of gravitational waves revealing the fabric of the universe itself, it appears that these measurements were contaminated by signals stemming from cosmic dust. Just the other day, Nature published a story suggesting that this alternative, simpler explanation now looks to be correct. I am not the Devil’s Astrophysicist so I can’t talk about the specifics of this research but it certainly sounds compelling from my laywoman’s  perspective: even without in-depth understanding of the research I can tell that the evidence suggests a simpler explanation for the initial findings and by Occam’s Razor alone I would be encouraged to accept this one as more probable.

They apparently also have orientation columns in space

I don’t see any problem with this and I am pretty sure neither do most scientists. This is just self-correction in action. If anything about this story has been problematic, it is the fact that the initial discovery was so widely publicized before it was validated. I don’t really know where I stand on this. On the one hand, this has been unfortunate because it could greatly confuse public opinion about this research. The same applies to the now “debunked” finding of arsenic life forms reported a few years ago. A spectacular, possibly paradigm-shifting discovery was reported widely only to be disproved at a later point. I could see that this sort of perceived instability in scientific discoveries could undermine the public’s trust in science with potentially devastating consequences. In a world where climate change deniers, “anti-vaxers”, and creationists are given an undeservedly strong platform to spread their views (no linking to those either for that very reason), we need to be very careful how science is communicated. A lack of faith in scientific research could do a lot of harm.

However, in my mind this doesn’t mean we should suppress media reports of new, possibly not-yet-validated discoveries. For one thing, in a free society we can’t forbid people from talking about their research. I also believe that science should be a leading example of democratic ideals and transparency because good science is based on it. Instead of strong regulation and rigid structures science flourishes when authority is questioned.

“[Science] connects us with our origins, and it too has its rituals and its commandments. Its only sacred truth is that there are no sacred truths. All assumptions must be critically examined. Arguments from authority are worthless.”

Carl Sagan – Cosmos, Episode 13 “Who speaks for Earth?”

More importantly, I believe as scientists we actually want to communicate the enthusiasm and joy we experience when we reveal secrets of the world – even if they are untrue. Rather than curtailing when and how we communicate our discoveries to the public, we should do a better job at communicating how science works in practice. By all means, we should improve our press releases to minimize the sweeping generalizations and unfounded speculations by the news media about the implications of a new finding. When communicating sensational results to the public, we should certainly convey our excitement. However, we should also always remind them (and ourselves) that any new discovery is never the last word on an issue but the first.

In my view helping the public understand that scientific knowledge is ever evolving and that “boffins” didn’t just “prove” things will in fact strengthen the public’s trust in science. The reason I, as a neuroscientist without any deeper understanding of atmospheric physics, believe that man-made climate change is a problem is that an overwhelming majority of climate scientists believe the evidence supports this hypothesis. I know what scientists are like. They disagree all the damn time about the smallest issues. If there is an overwhelming consensus on any point it is past time we listen!

The Psychoplication Crisis

After having talked about gravitational waves and arsenic bacteria, you can bet some smartass will come out of the woodwork and tell you that these are the “hard” natural sciences but the same does not apply to “soft” science like psychology or cognitive neuroscience. Within psychology/cognitive neuroscience the same snobbery exists between experimental psychology and social cognition researchers. Perhaps that is somewhat understandable as some of the first perception researchers were in fact physicists (this is the origin of the term psychophysics). However, I think this sort of segregation is delusional and the self-satisfied belief that your own research area is somehow a “harder” science can be quite dangerous because it makes it all that much easier to mislead you into thinking that your own research is not susceptible to these problems.

I don’t think the problems that are currently discussed in psychology and neuroscience are in any way specific to or even particularly pervasive in our field. Some concepts and theories are simply more established in physics than they are in psychology. We certainly don’t need to do any experiments to prove gravity or even to confirm many of the laws that seem to govern it. The same cannot be said about social psychology where people still question whether things like unconscious priming of complex behaviors exist in the first place. But the fact that sensational findings are hyped up by media reports only to be disproved in spectacular failures to replicate is evidently just as common in physics and microbiology. And even shocking revelations of scientific fraud have happened in those fields (see also here for a less clear-cut case – I’ll probably discuss this more next time).

One way in which the culture in physics research seems to differ from our field is that it is far more common to upload manuscripts to public repositories prior to peer review. This allows a wider, more public discussion of the purported discoveries which is certainly good for the scientific process. However, it also results in broader media coverage before the research has even passed the most basic peer review stage and this can in fact exacerbate the problems with unverified findings. The still rare cases of this approach in psychology research suffer from the same problem. Just look at Daryl Bem’s meta-analysis of “precognition” experiments which is available on such a repository even though it has not been formally published by a peer reviewed journal. Unlike his earlier findings, which were also widely discussed before they were published officially, this analysis hasn’t been picked up by many news outlets. However, it easily could have been, just as was the case for the earlier examples from physics and microbiology research. Another famous parapsychologist also posted it on his blog of as evidence that “the critics need to rethink their position” – a somewhat premature conclusion if you ask me.

In theory, I don’t see a problem with having new, un-reviewed manuscripts publicly available provided that the surrounding peer discussion is also visible. The problem is that this is not always the case. Certainly the news coverage of the peer discussion usually doesn’t match the media circus surrounding the original finding. Taking Bem’s meta-analysis as an example, it obviously has been under closed peer review for at least the better part of a year now. I would love to actually see this peer review discussion as it is developing but – with one exception (to my knowledge) – so far this review has been for the eyes of the peer reviewers only.

The Myths about Replication

I have talked about replication a lot in my previous posts and why I believe it is central to healthy science. Unfortunately, a lot of myths surround replication and how it should be done. The most common assertion you will hear amongst my colleagues about replication is that “we should be doing more of it”. Now at some level I don’t really disagree with this of course. However, I think these kinds of statements betray a misunderstanding of what replication is and how science actually works in practice.

Replication is at the heart of science

Scientists attempt replication all of the time. Most experiments, certainly all of the good ones, include replication of previous findings as part of their protocol. This is because it is essential to have a control condition or a “sanity check” on which to build future research. In his famous essay/lecture “Cargo Cult Science” Richard Feynman decries that there was apparently a widespread lack of understanding for this issue in the psychological sciences. Specifically, he describes how he advised a psychology student:

“One of the students told me she wanted to do an experiment that went something like this – it had been found by others that under certain circumstances, X, rats did something, A. She was curious as to whether, if she changed the circumstances to Y, they would still do A. So her proposal was to do the experiment under circumstances Y and see if they still did A. I explained to her that it was necessary first to repeat in her laboratory the experiment of the other person – to do it under condition X to see if she could also get result A, and then change to Y and see if A changed. Then she would know the real difference was the thing she thought she had under control. She was very delighted with this new idea, and went to her professor. And his reply was, no, you cannot do that, because the experiment has already been done and you would be wasting time.”

The reaction of this professor is certainly foolish. Every experiment we do should build on previous findings and reconfirm previous hypotheses before attempting to address any new questions. Of course, this was written decades ago. I don’t know if things have changed dramatically since then but I can assure my esteemed colleagues amongst the ranks of the Crusaders for True Science that this sort of replication is common in cognitive neuroscience.

Let me give you some examples. Much of my alter ego’s research employs a neuroimaging technique called retinotopic mapping. In these experiments subjects lie inside an MRI scanner whilst watching flickering images presented at various locations on a screen. By comparing which parts of the brain are active when particular positions in the visual field are stimulated with images, experimenters can construct a map of how a person’s field of view is represented in the brain.

We have known for a long time that such retinotopic maps exist in the human brain. It all started with Tatsui Inouye, a Japanese doctor, who studied soldiers who had bullet wounds that destroyed part of their cerebral cortex. He noticed that many patients experienced blindness at selective locations in their visual field. He managed to reconstruct the first retinotopic map by carefully plotting the correspondence of blind spots with the location of bullet wounds in certain parts of the brain.

Early functional imaging device for retinotopic mapping

With the advent of neuroimaging, especially functional MRI, came the advance that we no longer need to shoot bullets into people’s heads to do retinotopic mapping. Instead we can generate these maps non-invasively within a few minutes of scan time. Moreover, unlike these earlier neuroanatomical studies we can now map responses in brain areas where bullet wounds (or other damage) would not cause blindness. Even the earliest fMRI studies discovered additional brain areas that are organized retinotopically. More areas are being discovered all the time. Some areas we could expect to find based on electrophysiological experiments in monkeys. Others, however, appear to be unique to the human brain. Like with all science, the definition of these areas is not always without controversy. However, at this stage only a fool would doubt the existence of retinotopic maps and that they can be revealed with fMRI.

Not only that, but it is clearly a very stable feature of brain organization. Maps are pretty reliable on repeated testing even when using a range of different stimuli for mapping. The location of maps is also quite consistent between individuals so that if two people fixate the center of an analog clock, the number 3 will most likely drive neurons in the depth of the calcarine sulcus in the left cortical hemisphere to fire, while the number 12 will drive neurons in the lingual gyrus atop the lower lip of the calcarine. This generality is so strong that anatomical patterns can be used to predict the locations and borders of retinotopic brain areas with high reliability.

Of course, a majority of people will probably accept that retinotopy is one of the most replicated findings in neuroimaging. Retinotopic mapping analysis is even fairly free of concerns about activation amplitudes and statistical thresholds. Most imaging studies are not so lucky. They aim to compare the neural activity evoked by different experimental conditions (e.g. images, behavioral tasks, mental states), and then localize those brain regions that show robust differences in responses. Thus the raw activation levels in response to various conditions, and the way it is inferred statistically, is central to the interpretation of such imaging experiments.

This means that many studies, in particular in the early days of neuroimaging, were solely focused on this kind of question, to localize the brain regions responding to particular conditions. This typically results in brain images with lots of beautiful, colorful blobs being superimposed on an anatomical brain scan. For this reason this approach is often referred to by the somewhat derogatory term  “blobology” and it is implicitly (or sometimes explicitly) likened to phrenology because “it really doesn’t show anything about how the brain works”. I think this view is wrong, although it is certainly correct that localizing brain regions responding to particular conditions on its own cannot really explain how the brain works. This is in itself an interesting discussion but it is outside the scope of this post. Critically for the topic of replication, however, we should of course expect these blob localizations to be reproducible when repeating the experiments in the same as well as different subjects if we want to claim that these blobs convey any meaningful information about how the brain is organized. So how do such experiments hold up?

Some early experiments showed that different brain regions in human ventral cortex responded preferentially to images of faces and houses. Thus these brain regions were named – quite descriptively – fusiform face area and parahippocampal place area. Similarly, the middle temporal complex responds preferentially to moving relative to static stimuli, the lateral occipital complex responds more to intact, coherent objects or textures than to scrambled, incoherent images, and there have even been reports of areas responding preferentially to images of bodies or body parts and even to letters and words.

The nature of neuroimaging, both in terms of experimental design and analysis but also simply the inter-individual variability in brain morphology, means that there is some degree of variance in the localization of these regions when comparing the results across a group of subjects or many experiments. In spite of this, the existence of these brain regions and their general anatomical location is by now very well established. There have been great debates regarding what implications this pattern of brain responses has and whether there could be alternative, possibly more trivial, factors causing an area to respond. This is however merely the natural scrutiny and discussion that should accompany any scientific claims. In any case, regardless of what the existence of these brain regions may mean, there can be little doubt that these findings are highly replicable.

Experiments that aim to address the actual function of these brain regions are in fact perfect examples of how replication and good science are closely entwined. What such experiments typically do is to first conduct a standard experiment, called a functional localizer, to identify the brain region showing a particular response pattern (say, that it responds preferentially to faces). The subsequent experiments then seek to address a new experimental question, such as whether the face-sensitive response can be explained more trivially by basic attributes of the images. It is a direct example of the type of experiment Feynman suggested in that the experimenter first replicates a previous finding and then tests what factors can influence this finding. These replications are at the very least conceptually based on previously published procedures – although in many cases they are direct replications because functional localizer procedures are often shared between labs.

This is not specific to neuroimaging and cognitve neuroscience. Similar arguments could no doubt be made about other findings in psychology, for example considering the Stroop effect, and I’m sure it applies to most other research areas. The point is this: none of these findings were replicated because people deliberately set out to test the out the validity in a “direct replication” effort. There was no reprodicibility project for retinotopic mapping, no “Many Labs” experiment to confirm the existence of the fusiform face area. These findings have become replicated simply because researchers included tests of these previous results in their experiments, either as sanity checks or because they were an essential prerequisite for addressing their main research question.

Who should we trust?

In my mind, replication should always occur through this natural process. I am deeply skeptical of concerted efforts to replicate findings simply for the sake of replication. Science should be free of dogma and not motivated by an agenda. All too often the calls for why some results should be replicated seem to stem from a general disbelief in the original finding. Just look at the list of experiments on PsychFileDrawer that people want to see replicated. It is all fair and good to be skeptical of previous findings, especially the counter-intuitive, contradictory, underpowered, and/or those that seem “too good to be true”.

But it’s another thing to go from skepticism to actually setting out to disprove somebody else’s experiment. By all means, set out to disprove your own hypotheses. This is something we should all do more of and it guarantees better science. But trying to disprove other people’s findings smells a lot of crusading to me. While I am quite happy to give most of the people on PsychFileDrawer and in the reproducibility movement the benefit of the doubt that they are genuinely interested in the truth, whenever you approach a scientific question with a clear expectation in mind, you are treading on dangerous ground. It may not matter how cautious and meticulous you think you are in running your experiments. In the end you may just accumulate evidence to confirm your preconceived notions and this is not contributing much to advance scientific knowledge.

I also have a whole list of research findings of which I remain extremely skeptical. For example, I am wary of many claims about unconscious perceptual processing. I can certainly accept that there are simple perceptual phenomena, such as the tilt illusion, that can occur without conscious awareness of the stimuli because these processes are clearly so automatic that it is impossible to not experience them. In contrast, I find the idea that our brains process complex subliminal information, such as reading a sentence or segmenting a visual scene, pretty hard to swallow. I may of course be wrong but I am not quite ready to accept that conscious thought plays no role whatsoever in our lives, as some researchers seem to imply. In this context, I remain very skeptical of the notion that casually mentioning words that remind people of the elderly (like “Florida”) makes them walk more slowly or that showing a tiny American flag in the corner of the screen influences their voting behavior several months in the future. And like my host Sam (who has written far too much on this topic), I am extremely skeptical of claims of so-called psi effects, that is, precognition, telepathy, or “presentiment”. I feel that such findings are probably far more likely to be explained by more trivial explanations and that the authors of such studies are too happy to accept the improbable.

But is my skepticism a good reason for me to replicate these experiments? I don’t think so. It would be unwise to investigate effects that I trust so little. Regardless of how you control your motivation, you can never be truly sure that it isn’t affecting the quality of the experiment in some way. I don’t know what processes underlie precognition. In fact, since I don’t believe precognition exists it seems difficult to even speculate about the underlying processes. So I don’t know what factors to watch out for. Things that seem totally irrelevant to me may be extremely important. It is very easy to make subtle effects disappear by inadvertently increasing the noise in our measurements. The attitude of the researcher alone could influence how a research subject performs in an experiment. Researchers working on animals or in wet labs are typically well aware of this. Whether it is a sloppy experimental preparation or a disinterested experimenter training an animal on an experimental task, there are countless reasons why a perfectly valid experiment may fail. And if this can happen for actual effects, imagine how much worse it must be for effects that don’t exist in the first place!

As long as the Devil’s Neuroscientist has her fingers in Sam’s mind, he won’t attempt replicating ganzfeld studies or other “psi” experiments no matter how much he wants to see them replicated. See? I’m the sane one of the two of us!

Even though in theory we should judge original findings and attempts of replication by the same standard, I don’t think this is really what is happening in practice – because it can’t happen if replication is conducted in this way. Jason Mitchell seems to allude to this also in a much maligned commentary he published about this topic recently. There is an asymmetry inherent to replication attempts. It’s true, researcher degrees of freedom, questionable research practices and general publication bias can produce false positives in the literature. But still, short of outright fraud it is much easier to fail to replicate something than it is to produce a convincing false positive. And yet, publish a failed replication of some “controversial” or counter-intuitive finding, and enjoy the immediate approving nods and often rather unveiled shoulder slapping within the ranks of the Jihadists of Scientific Veracity.

Instead what I would like to see more of is scientists building natural replications into their own experiments. Imagine, for instance, that someone published the discovery of yet another brain area responding selectively to a particular visual stimulus, images of apples, but not others like tools or houses. The authors call this the occipital apple area. Rather than conducting a basic replication study repeating the methods of the original experiment step-by-step, you should instead seek to better understand this finding. At this point it of course remains very possible that this finding is completely spurious and that there simply isn’t such a thing as a occipital apple area. But the best way to reveal this is to test alternative explanations. For example, the activation in this brain region could be related to a more basic attribute of the images, such as the fact that the stimuli were round. Alternatively, it could be that this region responds much more generally to images of food. All of these ideas are straightforward hypotheses that make testable predictions. Crucially though, all of these experiments also require replication of the original result in order to confirm that the anatomical location of the region processing stimulus roundness or general foodstuffs actually corresponds to the occipital apple area reported by the original study.

Here are two possible outcomes of this experiment: in the first example you observe that when using the same methods as the original study you get a robust response to apples in this brain region. However, you also show that a whole range of other non-apple images evoke strong fMRI responses in this brain regions provided the depicted objects are round. Responses do not show any systematic relationship with whether the images are of food. They also respond to basketballs, faces, and snow globes but not to bananas or chocolate bars. Thus it appears like you have confirmed the first hypothesis, that the occipital apple area is actually an occipital roundness area. This may still not be the whole story but it is fairly clear evidence that this area doesn’t particularly care about apples.

Now compare this to the second situation: here you don’t observe any responses in this brain region to any of the stimuli, including the very same apples from the original experiment. What does this result teach us? Not very much. You’ve failed to replicate the original finding but any failure to replicate could very likely result from any number of factors we don’t as yet understand. As Sam (and thus also I) recently learned in a talk in which Ap Dijksterhuis discussed a recent failure to replicate of one of his social priming experiments, psychologists apparently call such factors “unknown moderators”.

Ap Dijksterhuis whilst debating with David Shanks at UCL

I find the explanation of failed replications by unknown moderators somewhat dissatisfying but of course such factors must exist. They are the unexplained variance I discussed in my previous posts. But of course there is a simpler explanation: that the purported effect simply doesn’t exist. The notion of unknown moderators is based on the underlying concept driving all science, that is, the idea that the universe we inhabit is governed by certain rules so that conducting an experiment with suitably tight control of the parameters will produce consistent results. So if you talk about “moderators” you should perform experiments testing the existence of such moderating factors. Unless you have evidence for a moderator, any talk about unknown moderators is just empty waffle.

Critically, however, this works both ways. As long as you can’t provide conclusive evidence as to why you failed to replicate, your wonderful replication experiment tells us sadly little about the truth behind the occipital apple area. Comparing the two hypothetical examples, I would trust the findings of the first example far more than the latter. Even if you and others repeat the apple imaging experiment half a dozen times and your statistics are very robust, it remains difficult to rule out that these repeated failures aren’t due to something seemingly-trivial-but-essential that you overlooked. I believe this is what Jason Mitchell was trying to say when he wrote that “unsuccessful experiments have no meaningful scientific value.”

Mind you, I think Mitchell’s commentary is also wrong about a great many things. Like other brave men and women standing up to the Crusaders (I suppose that makes them the “Heathens of Psychology Research?”) he implies that a lot of replications fail because they are conducted by researchers with little or no expertise in the research they seek to replicate. He also suggests that there are many procedural details that aren’t reported in methods sections of scientific publications, such as the fact that participants in fMRI experiments are instructed not to move. This prompted Sam’s colleague Micah Allen to create the Twitter hashtag #methodswedontreport. There is a lot of truth to the fact that some methodological details are just assumed common knowledge. However, the hashtag quickly deteriorated into a comedy vent because – you know – it’s Twitter.

In any case, while it is true that a certain level of competence is necessary to conduct a valid replication, I don’t think Mitchell’s argument holds water here. To categorically accuse replicators of incompetence simply because they have a different area of expertise is a logical fallacy. I’ve heard these arguments so many times. Whether it is about Bem’s precognition effects, Bargh’s elderly priming, or whatever other big finding people failed to replicate, the argument is often made that such effects are subtle and only occur under certain specific circumstances that only the “expert” seems to know about. My alter ego, Sam, has faced similar criticisms when he published a commentary about a parapsychology article. In some people’s eyes you can’t even voice a critical opinion about an experiment, let alone try to replicate them, if you haven’t done such experiments before. Doesn’t anybody perceive something of a catch-22 here?

Let’s be honest here. Scientists aren’t wizards and our labs aren’t ivory towers. The reason we publish our scientific findings is (apart from building a reputation and hopefully reaping millions of grant dollars) that we must communicate them to the world. Perhaps in previous centuries some scientists could sit in seclusion and tinker happily without anyone ever finding out the great truths about the universe they discovered. In this day and age this approach won’t get you very far. And the fact that we actually know about the great scientists of Antiquity and the Renaissance shows that even those scientists disseminated their findings to the wider public. And so they should. Science should benefit all of humanity and at least in modern times it is also often paid for by the public. There may be ills with our “publish or perish” culture but it certainly has this going for it: you can’t just do science on your own simply to satisfy your own curiosity.

The best thing about Birmingham is that they locked up all the psychology researchers in an Ivory Tower #FoxNewsFacts

Part of science communication is to publish detailed descriptions of the methods we use. It is true that some details that should be known to anyone doing such experiments may not be reported. However, if your method section is so sparse in information that it doesn’t permit a proper replication by anyone with a reasonable level of expertise, then it is not good enough! It is true, I’m no particle physicist and I would most likely fail miserably if I tried to replicate some findings from the Large Hadron Collider – but I sure as Hell office should be expected to do reasonably well at replicating a finding from my own general field even if I have never done this particular experiment before.

I don’t deny that a lack of expertise with a particular subject may result in some incompetence and some avoidable mistakes – but the burden of proof for this lies not with the replicator but with the person asserting that incompetence is the problem in the first place. By all means, if I am doing something wrong in my replication, tell me what it is and show me that it matters. If you can’t, your unreported methods or unknown moderators are completely worthless.

What can we do?

As I have said many times before, replication is essential. But as I have tried to argue here, I believe we should not replicate for replication’s sake. Replication of previous findings should be part of making new discoveries. And when that fails the onus should be on you to find out why. Perhaps the previous result was a false positive but if so you aren’t going to prove it with one or even a whole series of failed replications. You can however support the case by showing the factors that influence the results and testing alternative explanations. One failed replication of social priming effects caused a tremendous amount of discussion – and considerable ridicule for the original author because of the way in which he responded to his critics. For the Crusaders this whole affair seems to be a perfect example of the problems with science. However, ironically, it is actually one of the best examples of how a failed replication should look. The authors failed to replicate the original finding but they also tested a specific alternative explanation: that the priming effect was not caused by the stimuli to which the subjects were exposed but by the experimenters’ expectations. Now I don’t know if this study is any more true than the original one. I don’t know if these people simply failed to replicate because they were incompetent. They were clearly competent enough to find an effect in another experimental condition so it’s unlikely to just be that.

Contrast this case with the disagreement between Ap Dijksterhuis and David Shanks. On the one hand, you have nine experiments failing to replicate the original result. On the other hand, during the debate I witnessed with Sam’s eyes, you have Dijksterhuis talking about unknown moderators and wondering about whether Shanks’ experiments were done inside cubicles (they were, in case you were wondering – another case of #methodswedontreport). Is this the “expertise” and “competence” we need to replicate social priming experiments? The Devil is not convinced. But either way, I don’t think this discussion really tells us anything.

David Shanks as he regards studies claiming subconscious priming effects (No, he doesn’t actually look like that)

So we should make replication part of all of our research. When you are skeptical of a finding test alternative hypotheses about it. And more generally, always retain healthy skepticism. By definition any new finding will have been replicated less often than old findings that have made their way into the established body of knowledge. So when it comes to newly published results we should always reserve judgment and wait for them to be repeated. That doesn’t mean we can’t get excited about new surprising findings – in fact, being excited is a very good reason for people to want to replicate a result.

There are of course safeguards we can take to maximize the probability that new findings are solid. The authors of any study must employ appropriate statistical procedures and interrogate their data from various angles, using different analysis approaches, and employing a range of control experiments in order to ascertain how robust the results are. While there is a limit on how much of that we can expect any original study to do, there is certainly a minimum of rigorous testing that any study should fulfill. It is the job of peer reviewers, post-publication commenters, and of course also the researchers themselves to think of these tests. The decision of how much evidence suffices for an original finding can be made in correspondence with journal editors. These decisions will sometimes be wrong but that’s life. More importantly, regardless of a study’s truthiness, it should never be regarded as validated until it has enjoyed repeated replication by multiple studies from multiple labs. I know I have said it before but I will keep saying it until the message is finally getting through: science is a slow, gradual process. Science can approach truths about the universe but it doesn’t really ever have all the answers. It can come tantalizingly close but there will always be things that elude our understanding. We must have patience.

The discovery of universal truths by scientific research moves at a much slower pace than this creature from the fast lane.

Another thing that could very well improve the state of our field is something that Sam has long argued for and about which I actually agree with him (I told you scientists disagree with one another all the time and this even includes those scientists with multiple personalities). There ought to be a better way to quantify how often and how robustly any given finding has been replicated. I envision a system similar to Google Scholar or PubMed in which we can search for a particular result, say, a brain-behavior correlation, the location of the fusiform face area, of arsenic-based life forms, or of precognitive psychology experiments. The system then not only finds the original publication but displays a tree structure linking the original finding to all of the replication attempts and whether they were successful, and in how far they were direct or conceptual replications. A more sophisticated system could allow direct calculation of meta-analytical parameter estimates, for example to narrow down the stereotactic coordinates of the reported brain area or the effect size of a brain-behavior correlation.

Setting up such a system will certainly require a fair amount of meta-information and a database that permits the complex links between findings. I realize that this represents a fair amount of effort but once this platform is up and running and has become part of our natural process the additional effort will probably be barely noticeable. It may also be possible to automate many aspects of building this database.

Last but not least, we should remember that science should seek to understand how the world works. It should not be about personal vendettas and attachment or opposition to particular theories. I think it would benefit both the defending Heathens and the assaulting Crusaders to consider that more. Science should seek explanations. Trying to replicate a finding, simply because it seems suspect or unbelievable to you, is not science but more akin to clay pigeon shooting. Instead of furthering our understanding we just remain on square one. But to the Heathens I say this: if someone says that your theory is wrong or incomplete, or that your result fails to replicate, they aren’t attacking you. We are allowed to be wrong occasionally –  as I said before, science is always wrong about something. Science should be about evidence and disagreement, debate, and continuous overturning of previously held ideas. The single best weapon in the fight against the tedious Crusaders is this:

The fiercest critic of your own research should be you.

Christmas break

It is Christmas now and my alter ego, Sam, will be strong during this time so I won’t be able to possess his mind. And to be honest even demonic scientists need vacations. I will probably get in trouble with the boss about taking a Christmas vacation but so be it.

I won’t be able to check this blog (not often anyway) and probably won’t approve new commenters during the break etc. Discussions may continue in 2015!

Of self-correction and selfless errors

I had originally planned to discuss today’s topic at a later point, perhaps as part of my upcoming post about the myths of replication. However, discussions surrounding my previous posts, as well as the on-going focus on the use of posthoc power analysis in the literature, led me to the decision to address this point now.

A central theme of my previous two posts was the notion that science is self-correcting and that replication and skepticism are the best tools at our disposal. Discussions my alter ego Sam has had with colleagues, as well as discussions in the comment section on this blog and elsewhere, reveal that many from the ranks of the Crusaders for True Science call that notion into question. In that context, I would like to thank a very “confused” commenter on my blog for referring me to an article I hadn’t read, which is literally entitled “Why science is not necessarily self-correcting?“. I would also like to thank Greg Francis who commented on his application of statistical tests to detect possible publication bias in the literature. Recently I also became aware of another statistical procedure based on assumptions about statistical power called the Replication Index that was proposed as an alternative to the Test of Excess Significance used by Francis. I think these people are genuinely motivated by a selfless desire to improve the current state of science. This is a noble goal but I think it is fraught with errors and some potentially quite dangerous misunderstandings.

The Errors of Meta-Science

I will start with the statistical procedures for detecting publication bias and the assertion that most scientific findings are false positives. I call this entire endeavor “meta-science” because this underlines the fundamental problem with this whole discussion. As I pointed out in my previous post, science is always wrong. It operates like a model-fitting procedure to gradually improve the explanatory and predictive value of our attempts to understand a complex universe. The point that people are missing in this entire debate about asserted false positive rates, non-reproducibility, and publication bias is that the methods used to make these assertions are themselves science. Thus these procedures suffer from the same problems as any scientific effort in that they seek to approximate the truth but can never actually hope to reach it. By using scientific methods to understand the workings of the scientific method, the entire logic of this approach is circular.

Circular inference has recently received a bit of attention within neuroscience. I don’t know if the authors of this paper actually coined the term “voodoo correlations”. Perhaps they merely popularized it. The same logical fallacy has also been called “double-dipping“. However, all this is really simply circular reasoning and somewhat related to “begging the question“. It is a more a problem with flawed logic than with science. Essentially, it is what happens when you use the same measurements to test the validity of your predictions as you did for making the predictions in the first place.

This logical fallacy can result in serious errors. However, in the real world this isn’t actually entirely avoidable and it isn’t always problematic as long as we are aware of its presence. For instance, a point most people are missing is that whenever they report something like t(31)=5.2, p<0.001, or a goodness-of-fit statistic, etc, they are using circular inference. You are reporting an estimate of the effect size (be it a t-statistic, the goodness-of-fit, Cohen’s d or others) based on the observed data and then draw some sort of general conclusion from it. The goodness of a curve fit is literally calculated using the accuracy with which the model predicts the observed data. Just observing an effect size, say, a difference in some cognitive measure between males and females, can only tell you that this difference exists in your sample. You can make some probabilistic inferences about how this observed effect may generalize to the larger population and this is what statistical procedures do – however, in truth you cannot know what an effect means for the general population until you checked your predictions through empirical observations.

There are ways to get out of this dilemma, for example through cross-validation procedures. I believe this should be encouraged, especially whenever a claim about the predictive value of a hypothesis is made. More generally, replication attempts are of course a way to test predictions from previous results. Again, we should probably encourage more of that and cross-validation and replication can ideally be combined. Nevertheless, the somewhat circular nature of reporting observed effect sizes isn’t necessarily a major problem provided we keep in mind what an effect size estimate can tell us and what it can’t.

The same applies to the tests employed by meta-science. These procedures take an effect size estimate from the scientific literature, calculate the probability that this effect could have occurred under the conditions (statistical power), and then make inferences on this posthoc probability. The assumptions on which these procedures are based remain entirely untested. In so far that they make predictions at all, such as whether the effect is likely to be replicated in future experiments, no effort is typically made to test them. Statistical probabilities are not a sufficient replacement for empirical tests. You can show me careful, mathematically coherent arguments as to why some probability should be such and such – if the equation is based on flawed assumptions and/or it doesn’t take into account some confounds, the resulting conclusions may be untenable. This doesn’t necessarily mean that the procedure is worthless. It is simply like all other science. It constructs an explanation for the chaotic world out there that may or may not be adequate. It can never be a perfect explanation and we should not treat it as if it were the unadulterated truth. This is really my main gripe with meta-science: Proponents of these meta-science procedures treat them as if they were unshakable fact and the conviction with which some promote these methods borders on religious zeal.

One example is the assertion that a lot of scientific findings are false positives. This argument is based on the premise that many published experiments are underpowered and that publication bias (which we know exists because research is actively seeking positive results) means that mainly positive findings are reported. In turn this may explain what some have called the “Decline Effect“, that is, initial effect size estimates are inflated and they gradually decrease and approach the true effect size as more and more data are collected.

I don’t deny that lack of power and publication bias probably exist. However, I also think that current explanations are insufficient for explaining all the data. There are other reasons that may cause a reduction of effect size estimates as time goes on and more and more attempts at replication are made. Few of them are ever formally taken into account by meta-science models, partly because they are notoriously difficult to quantify. For instance, there is the question of whether all experiments are comparable. Even with identical procedures carried out by meticulous researchers, the quality of reagents, of experimental subjects, or generally the data we measure can differ markedly. I think this may be a particularly bad problem in psychology and cognitive neuroscience research although it probably exists in many areas of science. I call this the Data Quality Decay Function:

SubjectQualityTake for example the reliability and quality of data that can be expected from research subjects. In the early days after the experiment was conceived we test people drawn from subject pools of reliable research subjects. If it is a psychophysical study on visual perception, chances are that the subjects are authors on the paper or at least colleagues who have considerable experience with doing experiments. The data reaped from such subjects will be clean, low noise estimates of the true effect size. The cynical might call these kinds of subjects “representative” and possibly even “naive”, provided they didn’t co-author the paper at least.

As you add more and more subjects the recruitment pool will inevitably widen. At first there will be motivated individuals who don’t mind sitting in dark rooms staring madly at a tiny dot while images are flashed up in their peripheral vision. Even though the typical laboratory conditions of experiments are a long way from normal everyday behavior, people like this will be engaged and willing enough to perform reasonably on the experiment but they may fatigue more easily than trained lab members. Moreover, sooner or later your subject pool will encompass subjects with low motivation. There will be those who only take part in your study because of the money they get paid or (even worse) because they are coerced by course credit requirements. There may even be professional subjects who participate in several experiments within the same day. You can try to control this but it won’t be fool-proof, because in the end you’ll have to take them by their word even if you ask them whether they have participated in other experiments. And to be honest, professional subjects may be more reliable than inexperienced ones so it can be worthwhile to test them. Also, frequently you just don’t have the luxury to turn away subjects. I don’t know about your department but people weren’t exactly kicking in the doors of any lab I have seen just to participate in experiments.

Eventually, you will end up testing random folk off the street. This is what you will want if you are actually interested in generalizing your findings to the human condition. Ideally, you will test the effect in a large, diverse, multicultural, multiethnic, multiracial sample that encompasses the full variance of our species (this very rarely happens). You may even try to relax the strict laboratory conditions of earlier studies. In fact you’ll probably be forced to because Mongolian nomads or Amazonian tribeswomen, or whoever else your subject population may be, just don’t tend to hang around psychology departments in Western cities. The effect size estimate under these conditions will almost inevitably be smaller than those in the original experiments because of the reduced signal-to-noise ratio. Even if the true biological effect is constant across humanity, the variance will be greater.

This last point highlights why it isn’t so straightforward to say “But I want my findings to generalize so the later estimate reflects the truth more accurately”. It really depends on what your research question is. If you want to measure an effect and make general predictions as to what this means for other human beings, then yes, you should test as wide a sample as possible and understand why any meaningful effect size is likely to be small. Say, for instance, we are testing the efficacy of a new drug and characterize its adverse effects. Such experiments should be carried out on a wide-ranging sample to understand how differences between populations or individual background can account for side effects and whether the drug is even effective. You shouldn’t test a drug only on White Finnish men only to find out later that it is wholly useless or even positively dangerous in Black Carribean women. This is not just a silly example – this sort of sampling bias can be a serious concern.

On the other hand, when you are testing a basic function of the perceptual system in the human brain, testing a broad section of the human species is probably not the wisest course of action. My confidence in psychophysical results produced by experienced observers, even if they are certifiably non-naive and anything-but-blind to the purpose of the experiment (say, because they are the lead author of the paper and coded the experimental protocol), can still be far greater than it would be for the same measurements from individuals recruited from the general population. There are myriad factors influencing the latter that are much more tightly controlled in the former. Apart from issues with fatigue and practice with the experimental setting, they also may simply not really know what to look for. If you cannot produce an accurate report of your perceptual experience, you aren’t going to produce an accurate measurement of it.

Now this is one specific example and it obviously does not have to apply to all cases. I am pretty confident that the Data Quality Decay Function exists. It occurs for research subjects but it could also relate to reagents that are being reused, small errors in a protocol that accumulate over time, and many other factors. In many situations the slope of the curve may be very shallow so that the decay is really non-existent. There are also likely to be other factors that may counteract and, in some cases, invert the function. For instance, if the follow-up experiments actually improve the methodology of an experiment the data quality might even be enhanced. This is certainly the hope we have for science in general – but this development may take a very long time.

The point is, we don’t really know very much about anything. We don’t know how data quality, and thus effect size estimates, vary with time, across samples, between different experimenters, and so forth. What we do know is that under the most common assumptions (e.g. Gaussian errors of equal magnitude across groups) the sample sizes we can realistically use are insufficient for reliable effect size estimates. The main implication of the Data Quality Decay Function is that the effect size estimates under standard assumptions are probably smaller than the true effect.

While I am quite a stubborn lady, as I said earlier, I am not so stubborn to think this is the sole explanation. We know publication bias exists and so it is almost inevitable that it affects effect sizes in the literature. I also think that even if some of the procedures used to infer false positive rates and publication bias are based on untested assumptions and on logically flawed posthoc probabilities, they reveal some truths. All meta-science is wrong – but that doesn’t make it wholly worthless. I just believe we should take it with a grain of salt and treat it like all other science. In the long run, meta-science will self-correct.

LocalMinimum

Sometimes when you’re in a local minimum the view is just better

Self-correction is a fact

This brings me to the other point of today’s post, the claim that self-correction in science is a myth. I argue that self-correction is inherent to the scientific process itself. All the arguments against self-correction I have heard are based on another logical fallacy. People may say that the damn long time it took the scientific community to move beyond errors like phrenology or racial theories demonstrates that science does not by itself correct its mistakes. They suggest that because particular tools, e.g. peer review or replication, failed to detect serious errors or even fraudulent results, means that science itself does not weed out such issues.

The logical flaw here is that all of these things are effectively point estimates from a slow time series. It is the same misconception as to why people deny that global temperatures are rising because we have had some particularly cold winters in some specific years in fairly specific corners of the Earth or the error that leads creationists to claim that evolution has never been observed directly. It is why previous generations of scientists found it so hard to accept the thought that the surface of the Earth comprises tectonic plates floating on liquid magma. Fortunately, science has already self-corrected that latter misconception and seismology and tectonics are widely accepted theories well beyond the scientific community. Sadly, evolution and climate change have not arrived at the same level of mainstream acceptance.

It seems somewhat ironic that we as scientists should find it so difficult to understand that science is a gradual, slow process. After all we are all aware of evolutionary, geological, and astronomical time scales. However, in the end scientists are human and thus subject to the same perceptual limits and cognitive illusions as the rest of our species. We may get a bit of an advantage compared to other people who simply never need to think about similar spatial and temporal dimensions. But in the end, our minds aren’t any better equipped to fathom the enormity and age of the cosmos than anybody else’s.

Science is self-correcting because that is what science does. It is the constant drive to seek better answers to the same questions and to derive new questions that can provide even better answers. If the old paradigms are no longer satisfactory, they are abandoned. It can and it does happen all the time. Massive paradigm shifts may not be very frequent but that doesn’t mean they don’t happen. As I said last time, science does of course make mistakes and these mistakes can prevail for centuries. Using again my model-fitting analogy one would say that the algorithm gets stuck in a “local minimum“. It can take a lot of energy to get out of that but given enough time and resources it will happen. It could be a bright spark of genius that overthrows accepted theories. It could be that the explanatory construct of the status quo becomes so overloaded that it collapses like a house of cards. Or sometimes it may simply be a new technology or method that allows us to see things more clearly than before. Sometimes dogmatic, political, religious, or other social pressure can delay progress, for example, for a long time your hope of being taken seriously as a woman scientist was practically nil. In that case, what it takes to move science forward may be some fundamental change to our whole society.

Either way, bemoaning the fact that replication and skeptical scrutiny haven’t solved all problems and managed to rectify every erroneous assumption and refute every false result is utterly pointless. Sure, we can take steps to ensure that the number of false positives is reduced but don’t go so far to make it close to impossible to detect new important results. Don’t make the importance of a finding dependent on it being replicated hundreds of times first. We need replication for results to stand the test of time but scientists will always try to replicate potentially important findings. If nobody can be bothered to replicate something, it may not be all that useful – at least at the time. Chances are that in 50 or 100 or 1000 years the result will be rediscovered and prove to be critical and then our descendants will be glad that we published it.

By all means, change the way scientists are evaluated and how grants are awarded. I’ve said it before but I’ll happily repeat it. Immediate impact should not be the only yardstick by which to measure science. Writing grant proposals as catalogs of hypotheses when some of the work is inevitably exploratory in nature seems misguided to me. And I am certainly not opposed to improving our statistical practice, ensuring higher powered experiments, and encouraging strategies for more replication and cross-validation approaches.

However, the best and most important thing we can do to strengthen the self-correcting forces of science is to increase funding for research, to fight dogma wherever it may fester, and to train more critical and creative thinkers.

“In science it often happens that scientists say, ‘You know that’s a really good argument; my position is mistaken,’ and then they would actually change their minds and you never hear that old view from them again. They really do it. It doesn’t happen as often as it should, because scientists are human and change is sometimes painful. But it happens every day. I cannot recall the last time something like that happened in politics or religion.”

Carl Sagan

Why all research findings are false

(Disclaimer: For those who have not seen this blog before, I must again point out that the views expressed here are those of the demonic Devil’s Neuroscientist, not those of the poor hapless Sam Schwarzkopf whose body I am possessing. We may occasionally agree on some things but we disagree on many more. So if you disagree with me feel free to discuss with me on this blog but please leave him alone)

In my previous post I discussed the proposal that all¹ research studies should be preregistered. This is perhaps one of the most tumultuous ideas that are being pushed as a remedy for what ails modern science. There are of course others, such as the push for “open science”, that is, demands for free access to all publications, transparent post-publication review, and sharing of all data collected for experiments. This debate has even become entangled with age-old faith wars about statistical schools of thought. Some of these ideas (like preregistration or whether reviews should be anonymous) remain controversial and polarizing, while others (like open access to studies) are so contagious that they have become almost universally accepted up to the point that disagreeing with such well-meaning notions makes you feel like you have the plague. On this blog I will probably discuss each of these ideas at some point. However, today I want to talk about a more general point that I find ultimately more important because this entire debate is just a symptom of a larger misconception:

Science is not sick. It never has been. Science is how we can reveal the secrets of the universe. It is a slow, iterative, arduous process. It makes mistakes but it is self-correcting. That doesn’t mean that the mistakes don’t sometimes stick around for centuries. Sometimes it takes new technologies, discoveries, or theories (all of which are of course themselves part of science) to make progress. Fundamental laws of nature will perhaps keep us from ever discovering certain things, say, what happens when you approach the speed of light, leaving them for theoretical consideration only. But however severe the errors, provided our species doesn’t become extinct through cataclysmic cosmic events or self-inflicted destruction, science has the potential to correct them.

Also science never proves anything. You may read in the popular media about how scientists “discovered” this or that, how they’ve shown certain things, or how certain things we believe turn out to be untrue. But this is just common parlance for describing what scientists actually do: they formulate hypotheses, try to test them by experiments, interpret their observations, and use them to come up with better hypotheses. Actually, and quite relevant to the discussion about preregistration, this process frequently doesn’t start with the formulation of hypotheses but with making chance observations. So a more succinct description of a scientist’s work is this: we observe the world and try to explain it.

Science as model fitting

In essence, science is just a model-fitting algorithm. It starts with noisy, seemingly chaotic observations (the black dots in the figures below) and it attempts to come up with a model that can explain how these observations came about (the solid curves). A good model can then make predictions as to how future observations will turn out. The numbers above the three panels in this figure indicates the goodness-of-fit, that is, how good an explanation the model is for the observed data. Numbers closer to 1 denote better model fits.

CurveFitting

It should be immediately clear that the model in the right panel is a much better description of the relationship between data points on the two axes than the other panels. However, it is also a lot more complex. In many ways, the simple lines in the left or middle panel are much better models because they will allow us to make predictions that are far more likely to be accurate. In contrast, for the model in the right panel, we can’t even say what the curve will look like if we move beyond 30 on the horizontal axis.

One of the key principles in the scientific method is the principle of parsimony, also often called Occam’s Razor. It basically states that whenever you have several possible explanations for something, the simplest one is probably correct (it doesn’t really say it that way but that’s the folk version and it serves us just fine here). Of course we should weigh the simplicity of an explanation against it’s explanatory or predictive power. The goodness-of-fit of the middle panel is better than that of the left panel, although not by much. Nevertheless, it isn’t that much more complex than the simple linear relationship shown in the left panel. So we could perhaps accept the middle panel as our best explanation – for now.

The truth though is that we can never be sure what the true underlying explanation is. We can only collect more data and see how well our currently favored models do in predicting them. Sooner or later we will find that one of the models is just doing better than all the others. In the figure below the models fitted to the previous observations are shown as red curves while the black dots are new observations. It should have become quite obvious that the complex model in the right panel is a poor explanation for the data. The goodness-of-fit on these new observations for this model is now much poorer than for the other two. This is because this complex model was actually overfitting the data. It tried to come up with the best possible explanation for every observation instead of weighing explanatory power against simplicity. This is probably kind of what is going on in the heads of conspiracy theorists. It is the attempt to make sense of a chaotic world without taking a step back to think whether there might not be simpler explanations and whether our theory can make testable predictions. However, as extreme as this case may look, scientists are not immune from making such errors either. Scientists are after all human.

CrossValidation

I will end the model fitting analogy here. Suffice it to say that with sufficient data it should become clear that the curve in the middle panel is the best-fitting of the three options. However, it is actually also wrong. Not only is the function used to model the data not the one that was actually used to generate the observations, but the model also cannot really predict the noise, the random variability spoiling our otherwise beautiful predictions. Even in the best-fitting case the noise prevents us from predicting future observations perfectly. The ideal model would not only need to describe the relationship between data points on the horizontal and vertical axes but it would have to be able to predict that random fluctuation added on top of it. This is unfeasible and presumably impossible without a perfect knowledge of the state of everything in the universe from the nanoscopic to the astronomical scale. If we tried this it will most likely look like the overfitted example in the right panels. Therefore this unexplainable variance will always remain in any scientific finding.

Horizon

A scientist will keep swimming to find what lies beyond that horizon

Science is always wrong

This analogy highlights why the fear of incorrect conclusions and false positives that has germinated in recent scientific discourse is irrational and misguided. I may have many crises but reproducibility isn’t one of them. Science is always wrong. It is doomed to always chase a deeper truth without any hope of ever reaching it. This may sound bleak but it truly isn’t. Being wrong is inherent to the process. This is what makes science exciting. These ventures into the unknown drives most scientists, which is why many of us actually like the thought of getting up in the morning and going to work, why we stay late in the evening trying solve problems instead of doing something more immediately meaningful, and why we put up with pitifully low salaries compared to our former classmates who ended up getting “real jobs”. It is also the same daring and curiosity that drove our ancestors to invent tools, discover fire, and to cross unforgiving oceans in tiny boats made out of tree trunks. Science is an example of the highest endeavors the human spirit is capable of (it is not the only one but this topic is outside the scope of this blog). If I wanted unwavering certainty that I know the truth of the world, I’d have become a religious leader, not a scientist.

Now one of the self-declared healers of our “ailing” science will doubtless interject that nobody disagrees with me on this, that I am just being philosophical, or playing with semantics. Shouldn’t we guarantee, or so they will argue, that research findings are as accurate and true as they can possibly be? Surely, the fact that many primary studies, in particular those in high profile journals, are notoriously underpowered is cause for concern? Isn’t publication bias, the fact that mostly significant findings are published while null findings are not, the biggest problem for the scientific community? It basically means that we can’t trust a body of evidence because even in the best-case scenario the strength of evidence is probably inflated.

The Devil’s Neuroscientist may be evil and stubborn but she² isn’t entirely ignorant. I am not denying that some of the issues are problematic. But fortunately the scientific method already comes with a natural resistance, if not a perfect immunity, against these issues: skepticism and replication. Scientists use them all the time. For those people who have not quite managed to wrap their heads around the fact that I am not my alter ego, Sam Schwarzkopf, will say that I am sounding like a broken record³. While Sam and my humble self don’t see eye to eye on everything we probably agree on these points as he has repeatedly written about this in recent months. So as a servant of the devil, perhaps I sound like a demonic Beatles record: noitacilper dna msicitpeks.

There are a lot of myths about replication and reproducibility and I will write an in-depth post about that at a future point. Briefly though let me stress that, evil as I may be, I believe that replication is a corner stone of scientific research. Replication is the most trustworthy test for any scientific claim. If a result is irreplicable, perhaps because the experiment was just a once-in-an-age opportunity, because it would be too expensive to do twice, or for whatever other reason, then it may be interesting but it is barely more than an anecdote. At the very least we should expect pretty compelling evidence for any claims made about it.

Luckily, for most scientific discoveries this is not the case. We have the liberty and the resources to repeat experiments, with or without systematic changes, to understand the factors that govern them. We should and can replicate our own findings. We can and should replicate other people’s findings. The more we do of this the better. This doesn’t mean we need to go on a big replication rampage like the “Many Labs” projects. Not that I have anything against this sort of thing if people want to spend their time in this way. I think for a lot of results this is probably a waste of time and resources. Rather I believe we should encourage a natural climate of replication and I think it already exists although it can be enhanced. But as I said, I will specifically discuss replication in a future post so I will leave this here.

Instead let me focus on the other defense we have at our disposal. Skepticism is our best weapon against fluke results. You should never take anything you read in a scientific study at face value. If there is one thing every scientist should learn it is this. In writing scientific results look more convincing and “cleaner” than they are when you’re in the middle of experiments and data analysis. And even for those (rare?) studies with striking data, insurmountable statistics, and the most compelling intellectual arguments you should always ask “Could there be any other explanation for this?” and “What hypothesis does this finding actually disprove?” The latter question underlines a crucial point. While I said that science never proves anything, it does disprove things all the time. This is what we should be doing more of and we should probably start with our own work. Certainly, if a hypothesis isn’t falsifiable it is pretty meaningless to science. Perhaps a more realistic approach was advocated by Platt in his essay “Strong Inference“. Instead of testing whether one hypothesis is true we should pit two or more competing hypotheses against each other. In psychology and neuroscience research this is actually not always easy to do. Yet in my mind it is precisely the approach that some of the best studies in our field take. Doing this immunizes you from the infectiousness of dogmatic thinking because you no longer feel the need to prove your little pet theory and you don’t run control experiments simply to rule out trivial alternatives. But admittedly this is often very difficult because typically one of the hypotheses is probably more exciting…

The point is, we should foster a climate of where replication and skepticism are commonplace. We need to teach self-critical thinking and reward it. We should encourage adversarial collaborative replication efforts and the use of multiple hypotheses wherever possible. Above all we need to make people understand that criticism in science is not a bad thing but essential. Perhaps part of this involves training some basic people skills. It should be possible to display healthy, constructive skepticism without being rude and aggressive. Most people have stories to tell of offensive and irritating colleagues and science feuds. However, at least in my alter ego’s experience, most scientific disagreements are actually polite and constructive. Of course there are always exceptions: reviewer 2 we should probably just shoot into outer space.

What we should not do is listen to some delusional proposals about how to evaluate individual researchers, or even larger communities, by the replicability and other assessments of the truthiness of their results. Scientists must accept that we are ourselves mostly wrong about everything. Sometimes the biggest impact, in as far as that can be quantified, is not made by the person who finds the “truest” finding but by whoever lays the groundwork for future researchers. Even a completely erroneous theory can give some bright mind the inspiration for a better one. And even the brightest minds go down the garden path sometimes. Johannes Kepler searched for a beautiful geometry of the motion of celestial bodies that simply doesn’t exist. That doesn’t make it worthless as his work was instrumental for future researchers. Isaac Newton wasted years of his life dabbling in alchemy. And even on the things he got “right”, describing the laws governing motion and gravity, he was also really kind of wrong because his laws only describe a special case. Does anyone truly believe that these guys didn’t make fundamental contributions to science regardless of what they may have erred on?

We hope our pilot experiments won't all crash and burn

May all your pilot experiments soar over the clouds like this, not crash and burn in misery

Improbability theory

Before I will leave you all in peace (until the next post anyway), I want to make some remarks about some of the more concrete warnings about the state of research in our field. A lot of words are oozing out of the orifices in certain corners about the epidemic of underpowered studies and the associated spread of false positives in the scientific literature. Some people put real effort into applying statistical procedures to whole hosts of published results to reveal the existence of publication bias or “questionable research practices”. The logic behind these tests is that the aggregated power over a series of experiments makes it very improbable that statistically significant effects could be found in all of them. Apparently, this test flags up an overwhelming proportion of studies in some journals as questionable.

I fail to see the point of this. First of all, what good will come from naming and shaming studies/researchers who apparently engaged in some dubious data massaging, especially when, as we are often told, these problems are wide-spread? One major assertion that is then typically made is that the researchers ran more experiments than they reported in the publication but that they chose to withhold the non-significant results. While I have no doubt that this does in fact happen occasionally, I believe it is actually pretty rare. Perhaps it is because Sam, whose experiences I share, works in neuroimaging where it would be pretty damn expensive (both in terms of money and time investment) to run lots of experiments and only publishing the significant or interesting ones. Then again, he certainly has heard of published fMRI studies where a whopping number of subjects were excluded for no good reason. So some of that probably does exist. However, he was also trained by his mentors to believe that all properly executed science should be published and this is the philosophy by which he is trying to conduct his own research. So unless he is somehow rare in this or behavioral/social psychology research (about which claims of publication bias are made most often) are for some reason much worse than other fields, I don’t think unreported experiments are an enormous problem.

What instead might cause “publication bias” is the tinkering that people sometimes do in order to optimize their experiments and/or maximize the effects they want to measure. This process is typically referred to as “piloting” (not sure why really – what does this have to do with flying a plane?). It is again highly relevant to our previous discussion or preregistration. This is perhaps the point where preregistration of an experimental protocol might have its use: First do lots of tinker-explore-piloting to optimize the ways to address an experimental question. Then preregister this optimized protocol to do a real study to answer the question but strictly follow the protocol. Of course, as I have argued last time, instead you could just publish the tinkered experiments and then you or someone else can try to replicate using the previously published protocol. If you want to preregister those efforts, be my guest. I am just not convinced it is necessary or even particularly helpful.

Thus part of the natural scientific process will inevitably lead to what appears like publication bias. I think this is still pretty rare in neuroimaging studies at least. Another nugget of wisdom about imaging Sam has learned from his teachers, and which is he is trying to impart on his own students, is that in neuroimaging you can’t just constantly fiddle with your experimental paradigm. If you do so you will not only run out of money pretty quickly but also end up with lots of useless data that cannot be combined in any meaningful way. Again, I am sure some of these things happen (maybe some people are just really unscrupulous about combining data that really don’t belong together) but I doubt that this is extremely common.

So perhaps the most likely inflation of effect sizes in a lot of research stems from questionable research practices often called “p-hacking”, for example trying different forms of outlier removal or different analysis pipelines and only reporting the one producing the most significant results. As I discussed previously, preregistration aims to control for this by forcing people to be upfront about which procedures they planned to use all along. However, a simpler alternative is to ask authors to demonstrate the robustness of their findings across a reasonable range of procedural options. This achieves the same thing without requiring the large structural change of implementing a preregistration system.

However, while I believe some of the claims about inflated effect sizes in the literature are most likely true, I think there is a more nefarious problem with the statistical approach to inferring such biases. It lies in its very nature, namely that it is based on statistics. Statistical tests are about probabilities. They don’t constitute proof. Just like science at large, statistics never prove anything, except perhaps for the rare situations where something is either impossible or certain – which typically renders statistical tests redundant.

There are also some fundamental errors in the rationale behind some of these procedures. To make an inference about the power of an experiment based on the strength of the observed result is to incorrectly assign a probability to an event after it has occurred. The probability of an observed event occurring is 1 – it is completely irrelevant how unlikely it was a priori. Proponents of this approach try to weasel out of this conundrum by stating that they assume the true effect size to be of a similar magnitude as what was observed in the published experiment and using this as the assumed power of the experiment. This assumption is untenable because the true effect size is almost certainly not that which was observed. There is a lot more to be said about this state of affairs but I won’t go into this because others have already summarized many of the arguments about this much better than I could.

In general I simply wonder how good statistical procedures actually are at estimating true underlying effects in practice. Simulations are no doubt necessary to evaluate a statistical method because we can work with known ground truths. However, they can only ever be approximations to real situations encountered in experimental research. While the statistical procedures for publication bias probably seem to make sense in simulations, their true experimental validity actually remains completely untested. In essence, they are just bad science because they aim to show an effect without a control condition, which is really quite ironic. The very least I would expect to see from these efforts is some proof that these methods actually work for real data. Say we set up a series of 10 experiments for an effect we can be fairly confident actually exists, for example the Stroop effect or the fact visual search performance for a feature singleton is independent of set size while searching for a conjunction of features is not. Will all or most of these 10 experiments come out significant? And if so, will the “excess significance test” detect publication bias?

Whatever the outcome of such experiments on these tests, one thing I already know: any procedure that claims to find evidence that over four of five published studies should not be believed is not to be believed. While we can’t really draw firm conclusions from this, the fact that this rate is the same in two different applications of this procedure certainly seems suspicious to me. Either it is not working as advertised or it is detecting something trivial we should already know. In any case, it is completely superfluous.

I also want to question a more fundamental problem with this line of thinking. Most of these procedures and demonstrations of how horribly underpowered scientific research is seems to make a very sweeping assumption: that all scientists are generally stupid. Researchers are not  automatons that blindly stab in the dark in the hope that they will find a “significant” effect. Usually scientists conduct research to test some hypothesis that is more or less reasonable. Even the most exploratory wild goose chases (and I have certainly heard of some) will make sense at some level. Thus the carefully concocted arguments about the terrible false discovery rates in research probably vastly underestimate the probability of that hypothesized effects actually exist and there is after all “reason to think that half the tests we do in the long run will have genuine effects.”

Naturally, it is hard to put concrete numbers on this. For some avenues of research it will no doubt be lower. Perhaps for many hypotheses tested by high-impact studies the probability may be fairly low, reflecting the high risk and surprise factor of these results. For drug trials the 10% figure may be close to the truth. For certain effects, such as those precognition or telepathy or homeopathy, I agree with Sam Schwarzkopf, Alex Holcombe, and David Colquoun (to name but a few) that the probability that they exist is extremely low. But my guess is that in many fields the probability ought to be better than a coin toss that hypothesized effects exist.

Wine

To cure science the Devil’s Neuroscientist prescribes a generous dose of this potion (produced at farms like this one in New Zealand)

Healthier science

I feel I have sufficiently argued that science isn’t actually sick so I don’t think we need to wreck our heads about possible means to cure it. However, this doesn’t imply we can’t do better. We can certainly aim to keep science healthy or make it even healthier.

So what is to be done? As I have already argued, I believe the most important step we should take is to encourage replication and a polite but critical scrutiny of scientific claims.  I also believe that at the root of most of the purported problems with science these days is the way we evaluate impact and how grants are allocated. Few people would say that the number of high impact publications on a resume tells us very much about how good a scientist a person is. Does anyone? I’m sure nobody truly believes that the number of downloads or views or media reports a study receives tells us anything about its contribution to science.

And yet I think we shouldn’t only value those scientists who conduct dry, incremental research. I don’t know what is a good measure of a researcher’s contribution on their field. Citations are not perfect but they are probably a good place to start. There probably is no good way other than hearsay and personal experience to really know how careful and skilled a particular scientist is in their work.

What I do know is that the replicability of one’s research and the correctness of one’s hypotheses alone aren’t a good measure. The most influential scientists can also be the ones who make some fundamental errors. And there are some brilliant scientists, whose knowledge is far greater than mine (or Sam’s) will ever be and whose meticulousness and attention-to-detail would put most of us to shame – but they can and do still have theories that will turn out to be incorrect.

If we follow down that dead end the Crusaders for True Science have laid out for us, if we trust only preregistered studies and put those who are fortuitous (or risk averse) enough to only do research that ends up being replicated on pedestals, in short, if we only regard “truth” in science, we will emphasize the wrong thing. Then science will really be sick and frail and it will die a slow, agonizing death.

¹ Proponents of preregistration keep reminding us that “nobody” suggests that preregistration should be mandatory or that it should be for all studies. These people I want to ask, what do you think will happen if preregistration becomes commonplace? How would you regard non-registered studies? What kinds of studies should not be preregistered?

² The Devil’s Neuroscientist recently discovered she is a woman but unlike other extra-dimensional entities the Devil’s Neuroscientist is not “whatever it wants to be.”

³ Does anyone still know what a record is? Or perhaps in this day and age they know again?

The Pipedream of Preregistration

(Disclaimer: As this blog is still new, I should reiterate that the opinions presented here are those of the Devil’s Neuroscientist, which do not necessarily overlap with those of my alter ago, Sam Schwarzkopf)

In recent years we have often heard that science is sick. Especially my own field, cognitive neuroscience and psychology, is apparently plagued by questionable research practices and publication bias. Our community abounds with claims that most scientific results are “false” due to lack of statistical power. We are told that “p-hacking” strategies are commonly used to explore the vast parameter space of their experiments and analyses in order to squeeze the last drop of statistical significance out of their data. And hushed (and sometimes quite loud) whispers in the hallways of our institutions, in journal club sessions, and at informal chats at conferences tell of many a high impact study that has repeated failed to be replicated but these failed replications vanish in the bottom of the proverbial file drawer.

Many brave souls have taken up the banner of fighting against this horrible state of affairs. There has been a whole spade of replication attempts of high impact research findings, the open access movement aims to make it more possible to published failed replications, and many proposals have been put forth to change the way we make statistical inferences from our data. These are all large topics in themselves and I will probably tackle them in later posts on this blog.

For my first post though I instead want to focus on the preregistration of experimental protocols. This is the proposal that all basic science projects should be preregistered publicly with an outline of the scientific question and the experimental procedures, including the analysis steps. The rationale behind this idea is that questionable research practices, or even just fairly innocent flexibility in procedures (“researcher degrees of freedom”) that could skew results and inflate false positives, will be more easily controlled. The preregistration idea has been making the rounds during the past few years and it is beginning to be implemented both in the forms of open repositories and some journals. In addition to fixing the validity of published research, preregistration is also meant as an assurance that failed replications are published because acceptance – and publication – of a study does not hinge on how strong and clean the results are but only on whether the protocol was sound and whether it was followed.

These all sound like very noble goals and there is precedent for such preregistration systems in clinical trials. So what is wrong with this notion? Why does this proposal make the Devil’s Neuroscientist anxious?

Well, I believe it is horribly misguided, that it cannot possibly work, and that – in the best-case scenario – it will make no difference to science-society’s ills. I think that well-intentioned as the preregistration idea may be, it actually results from people’s ever shortening 21st century attention spans because they can’t accept that science is a gradual and iterative process that takes decades, sometimes centuries, to converge on a solution.

Basic science isn’t clinical research

There is a world of difference between the aims of basic scientific exploration and clinical trials. I can get behind the idea that clinical tests, say of new drugs or treatments, ought to be conservative and minimizing the false positives in the results. Flexibility in the way data are collected and analyzed, how outliers are treated, how side effects are assessed, and so on, can seriously hamper the underlying goal: finding a treatment that actually works well.

I can even go so far to accept that a similarly strict standard ought to be applied to preclinical research, say animal drug tests. The Devil’s Neuroscientist may work for the Evil One but she is not without ethics. Any research that is meant to test the validity of an approach that can serve the greater good should probably be held to a strict standard.

However, this does not apply to basic research. Science is the quest to explain how the universe works. Exploration and tinkering is at the heart of this endeavor. In fact, I want to see more of this, not less. In my experience (or rather my alter ego’s experience – the Devil’s Neuroscientist is a mischievous demon possessing Sam’s mind and at the time of writing this she is only a day old – but they share the same memories) it is one of the major learning experiences most graduate students and postdocs go through to analyze your data to death.

By tweaking all the little parameters, turning on every dial, and looking at a problem from numerous angles we can get a handle on how robust and generalizable our findings truly are. In every student’s life sooner or later there comes a point where this behavior leads to them to the conclusion their “results are spurious and don’t really show anything of interest whatsoever.” I know, because I have been to that place (or at least my alter ego has).

This is not a bad thing. On the contrary I believe it is actually essential for good science. Truly solid results will survive even the worst data massaging to borrow a phrase Sam’s PhD supervisor used to say. It is crucial that researchers really know their data inside out. And it is important to understand the many ways an effect can be made to disappear, and conversely the ways data massaging can lead to “significant” effects that aren’t really there.

Now this last point underlines that that data massaging can indeed be used to create false or inflated research findings, and this is what people mean when they talk about researcher degrees of freedom or questionable research practices. You can keep collecting data, peeking at your significance level at every step, and then stop when you have a significant finding (this is known as “optional stopping” or “data peeking”). This approach will quite drastically inflate the false positives in a body of evidence and yet such practice may be common. And there may be (much) worse things out there, like the horror story someone (and I have reason to believe them) told me of a lab where the standard operating mode was to run a permutation analysis by iteratively excluding data points to find the most significant result. I neither know who these people were nor where this lab is. I also don’t know if this practice went on with or without the knowledge of the principal investigator. It is certainly not merely a “questionable” research practice but it has crossed the line into outright fraudulence. As the person who told me of this pointed out, if someone is clever enough to do this, it seems likely that they also know that this is wrong. The only difference from doing this and actually making up your data from thin air and eating all the experimental stimuli (candy) to cover up the evidence is that it actually uses real data – but it might as well not for all the validity we can expect from that.

But the key thing to remember here is that this is deep inside the realm of the unethical, certainly on some level of scientific hell (and no, even though I work for the Devil this doesn’t mean I wish to see anybody in scientific hell). Preregistration isn’t going to stop fraud. Cheaters gonna cheat. Yes, the better preregistration systems actually require the inclusion of a “lab log” and possibly all the acquired data as part of the completed study. But does anyone believe that this is really going to work to stop a fraudster? Someone who regularly makes use of a computer algorithm to produce the most significant result isn’t going to bat an eyelid at dropping a few data points from their lab log. What is a lab log anyway? Of course we keep records of our experiments, but unless we introduce some (probably infeasible) Orwellian scheme in which every single piece of data is recorded in a transparent, public way (and there have been such proposals), there is very little to stop a fraudster from forging the documentation for their finished study. And you know what, even in that Big Brother world of science a fraudster would find a way to commit fraud.

Most proponents of preregistration know this and are wont to point out that preregistration isn’t to stop outright fraud but questionable research practices – the data massaging that isn’t really fraudulent but still inflates false positives. These practices may stem from the pressure to publish high impact studies or may even simply be due to ignorance. I certainly believe that data peeking falls into the latter category or at least did before it was widely discussed. I think it is also common because it is intuitive. We want to know if a result is meaningful and collect sufficient data to be sure of it.

The best remedy against such things is to educate people of what is and isn’t acceptable. It also underlines the importance of deriving better analysis methods that are not susceptible to data peeking. There have been many calls to abandon null hypothesis significance testing altogether. This discussion may be the topic of another post by the Devil’s Neuroscientist in the future as there are a lot of myths about this point. However, at this point I certainly agree that we can do better and that there are ways – which may or may not be Bayesian – to use a stopping criterion to improve the validity of scientific findings.

Rather than mainly sticking to a preregistered script, I think we should encourage researchers to explore the robustness of their data by publishing more additional analyses. This is what supplementary materials can be good for. If you remove outliers in your main analysis, show what the result looks like without this step or at least include the data. If you have several different approaches, show them all. The reader can make up their own mind about whether a result is meaningful and their judgment should be all the more correct the more information there is.

Most importantly, people should replicate findings and publish those replications. This is a larger problem and in the past it had been difficult to publish replications. This situation has changed a lot however and replication attempts are now fairly common even in high profile journals. No finding should ever be regarded as solid until it has stood the test of time after repeated and independent replication. Preregistration isn’t going to help with that. Making it more rewarding and interesting to publish replications will. There are probably still issues to be resolved regarding replication although I don’t think the situation is as dire as it is often made out to be (and this will probably be the topic of another post in the future).

Like cats, scientists are naturally curious

Like cats, scientists are naturally curious and always keen to explore the world

Replications don’t require preregistration

The question of replication brings me to the next point. My alter ego has argued in the past that preregistration may be particularly suited for replication attempts. At the surface this seems logical because for a replication surely we want to stick as closely as possible to the original protocol so it is good to have that defined a priori.

While this is true, as I clawed my way out of the darkest reaches of my alter ego’s mind, it dawned on me that for replications the original experimental protocol is already published in the original study. All one needs to do is to follow the original protocol as closely as possible. Sure, in many publications the level of detail may be insufficient for a replication, which is in part due to the ridiculously low word limits of many journals.

However, the solution for this problem is not preregistration because preregistration doesn’t guarantee that the replication protocol is a close match to the original. Rather we must improve the level of detail of our methods sections. They are after all meant to permit replication. Fortunately, many journals that were particularly guilty of this problem have taken steps to change this. I don’t even mind if methods are largely published in online supplementary materials. A proper evaluation of a study requires close inspection of the methods but as long as they are easy to access I prefer detailed online methods to sparse methods sections inside published papers.

Proponents of preregistration also point out that having an experiment preregistered with a journal helps the publication of replication attempts because it ensures that the study will be published regardless of the outcome. This is perhaps true but at present there are not many journals that actually permit preregistration. It also forces the authors’ hands as to where to publish, which will be a turn off for many people.

Imagine that your finding fails to replicate a widely publicized, sensational result. Surely this will be far more interesting to the larger scientific community than if your findings confirm the previous study. Both are actually of equal importance and in truth having two studies investigating the same effect isn’t actually telling us much more about the actual effect than one study would. However, the authors may want to choose a high profile journal for one outcome but not for another (and the same applies to non-replication experiments). Similarly, the editors of a high impact journal will be more interested in one kind of result than another. My guess is PNAS was far keener to publish the failed replication of this study, than it would have been if the replication had confirmed the previous results.

While we still have the journal-based publication system or until we find another way for journals to decide where preregistered studies are published, preregistering studies with a journal forces the authors into a relationship with that journal. I predict that this is not going to appeal to many people.

We replicated the experiment two hours later, and found no evidence of this "sun in the sky". We conclude that the original finding was spurious.

We replicated the experiment two hours later, and found no evidence of this “sun in the sky” (BF01=10^100^100). We conclude that the original finding was spurious.

Ensuring the quality of preregistered protocols

Of course, we don’t need to preregister protocols with a journal but we could have a central repository where such protocols are uploaded. In fact, there is already such a place in the Open Science Framework, and, always keen to foster mainstream acceptance, a trial registry has also been set up for parapsychology. This approach would free the authors with regard to where the final study is published but this comes at a cost: at least at present these repositories do not formally review the scientific quality of the proposals. At least at a journal the editors will invite expert reviewers to assess the merits of the proposed research and suggest possible changes to be implemented before the data collection even begins.

Theoretically this is also possible at a centralized repository, but this is currently not done. It would also cause a major burden on the peer review system. There already are an enormous number of research manuscripts out there that want to be reviewed by someone (just ask the editors at the Frontiers journals). This number would probably inflate that workload massively because it is substantially easier to draft a simple design document for an experiment than it is to write up a fully-fledged study with results, analysis, and interpretation. Incidentally, this is yet another way in which clinical trials differ from basic research: in clinical trials you presumably already have a treatment or drug whose efficacy we want to assess. This limits the number of trials somewhat at least. In basic research all bets are off – you can have as many ideas for experiments as your imagination permits.

So what we will be left with is lots of preregistered experimental protocols that are either reviewed shoddily or not at all. Sam recently reviewed an EEG study claiming to have found neural correlates of telepathy. All the reviews are public so everyone can read them. The authors of this study actually preregistered their experimental protocol at the Open Science Framework. The protocol was an almost verbatim copy of an earlier pilot study the authors had done, plus some minor changes that were added in the hope to improve the paradigm. However, the level of detail in the methods was so sparse and obscure that it made assessment of the research, let alone an actual replication attempt, nigh impossible. There were also some fundamental flaws in the analysis approach that indicated that the entire results, including those from the pilot experiment, were purely artifactual. In other words, the protocol might as well not have been preregistered.

More recently, another study used preregistration (again without formal review) to conduct a replication attempt of a series of structural brain-behavior experiments. You can find a balanced and informative summary of the findings at this blog and at the bottom you will find an extensive discussion, including comments by my alter ego. What these authors did was to preregister the experimental protocol by uploading it to the webpage of one of the authors. They also sent the protocol to the authors of the original studies they wanted to replicate to seek their feedback. Minimal (or lack of) response was then assumed to be tacit agreement that the protocol was appropriate.

The scientific issues of this discussion are outside the scope of this post. Briefly, it turns out that at least for some of the experiments in this study the methods are only a modest match with those of the original studies. This should be fairly clear from reading the original methods sections. Whether or not the original authors “agreed” with the replication protocols (and it remains opaque just what that means exactly), there is already a clear departure from the actually predefined script at the very outset.

It is of course true that a finding should be generalizable to be of importance and robustness to certain minor variations in the approach should be part of that. For example, it was recently argued by Daryl Bem that failure to replicate his precognition results was due to the fact that the replicators did not use his own stimulus software for the replication. This is not a defensible argument because to my knowledge the replication followed the methods outlined in the original study fairly closely. Again, this is what methods sections are for. Therefore, if the effect can really only be revealed by using the original stimulus software this at the very least suggests that it doesn’t generalize. Before drawing any further conclusions it is therefore imperative to understand where the difference lies. It could certainly be that the original software is somehow better at revealing the effect, but it could also mean that it has a hidden flaw resulting in an artifact.

The same isn’t necessarily true in the case of these brain-behavior correlations. It could be true, but at present we have no way of knowing it. The methods of the original study as published in the literature weren’t adhered to, so it is incorrect to even call this a direct replication. Some of the discrepancies could very well be the reason why the effect disappears in the replication or conversely could also introduce a spurious effect in the original studies.

This is where we come back to preregistration. One of the original authors was actually also a reviewer of this replication study of these brain-behavior correlations. His comments are also included on that blog and he further elaborates on his reviews in the discussion on that page. He suggests that he proposed additional analyses to the replicators that are actually a closer match to the original methods but they refused to conduct these analyses because they are exploratory and thus weren’t part of the preregistered protocol. However, this is odd because several additional exploratory analyses are included (and clearly labeled as such) in the replication study. Moreover, the original author suggests he conducted the analysis he suggested on the replication data and that this in fact confirms the original findings. Indeed, another independent successful replication of one of these findings was published but not taken into account by this replication. As such it seems odd that only some of the exploratory methods are included in this replication and some more cynical than the Devil’s Neuroscientist (who is already pretty damn cynical) might call that cherry picking.

What this example illustrates is that it is actually not very straightforward to evaluate and ideally improve a preregistered protocol. First of all, sending out the protocol to (some) of the original authors is not the same as obtaining solid agreement that the methods are appropriate. Moreover, not taking suggestions on board at a later stage, when data collection/analysis has already commenced, is hindering good science. Take again my earlier example of the EEG study Sam reviewed. In his evaluation the preregistered protocol was fundamentally flawed, resulting in completely spurious findings. The authors revised their manuscript and performed different analyses Sam suggested, which essentially confirmed that there was no evidence of telepathy (although the authors never quite got to the point of conceding this).

Now unlike the Devil’s Neuroscientist, Sam is merely a fallible human and any expert can make mistakes. So you should probably take his opinion with a grain of salt. However, I believe in this case his view of that experiment is entirely correct (but then again I’m biased). What this means is that under a properly adhered to preregistration system the authors would have to perform the preregistered procedures even though they are completely inadequate. Any improvements, no matter how essential, will have to be presented as “additional exploratory procedures”.

This is perhaps a fairly extreme case but not unrealistic. There may be many situations where a collaborator or a reviewer (assuming the preregistered protocol is public) will suggest an improvement over the procedures after data collection has started. In fact, if a reviewer makes this suggestion then I would regard this as unproblematic to alter the design post-hoc. After all, preregistration is supposed to stop people from data massaging, not from making independently advised improvements to the methods. Certainly, I would rather see well designed, meticulously executed studies that have a level of flexibility, than a preregistered protocol that is deeply flawed.

The thin red line

Before I conclude this (rather long) first post to my blog, I want to discuss another aspect of preregistration that I predict will be its largest problem. As many proponents of preregistration do not tire to stress, preregistration does not preclude exploratory analyses or even whole exploratory studies. However, as I discussed in the previous sections, there are actually complications with this. My prediction is that almost all studies will contain a large amount of exploration. In fact, I am confident that the best studies will contain the most exploration because – as I wrote earlier – thorough exploration of the data is actually natural, it is useful, and it should be encouraged.

For some studies it may be possible to predict many of the different angles from which to analyze the data. It may also be acceptable to modify small mistakes in the preregistered design post-hoc and clearly label these changes. However, by and large what I believe will happen is that we will have a large number of preregistered protocols but that the final publications will in fact contain a lot of additional exploration, and that some of the scientifically most interesting information will usually be in there.

How often have you carried out an experiment only to find that your beautiful hypotheses aren’t confirmed, that the clear predictions you made do not pan out? These are usually the most interesting results because they entice you to dig deeper, to explore your data, and generate new, better, more exciting hypotheses. Of course, none of this is prevented by preregistration but in the end the preregistered science is likely to be the least interesting part of the literature.

But there is also another alternative. Perhaps we are enforcing preregistration more strictly. Perhaps only a very modest amount of exploration will be permitted after all. Maybe preregistered protocols will have to contain every detailed step in your methods, including the metal screening procedure for MRI experiments, exact measurements of the ambient light level, temperature, and humidity in the behavioral testing room, and the exact words each experimenter will say to each participant before, during, and after the actual experiment, without any room for improvisation (or, dare I say, natural human interaction). It may be that only preregistered studies that closely follow their protocol will be regarded as good science.

This alternative strikes me as a nightmare scenario. Not only will this stifle creativity and slow the already gradual progress of science down to a glacial pace, it will also rob science of the sense of wonder that attracted many of us to this underpaid job in the first place.

The road to hell is paved with good intentions – Wait. Does this mean I should be in favor of it?