(Disclaimer: As this blog is still new, I should reiterate that the opinions presented here are those of the Devil’s Neuroscientist, which do not necessarily overlap with those of my alter ago, Sam Schwarzkopf)
In recent years we have often heard that science is sick. Especially my own field, cognitive neuroscience and psychology, is apparently plagued by questionable research practices and publication bias. Our community abounds with claims that most scientific results are “false” due to lack of statistical power. We are told that “p-hacking” strategies are commonly used to explore the vast parameter space of their experiments and analyses in order to squeeze the last drop of statistical significance out of their data. And hushed (and sometimes quite loud) whispers in the hallways of our institutions, in journal club sessions, and at informal chats at conferences tell of many a high impact study that has repeated failed to be replicated but these failed replications vanish in the bottom of the proverbial file drawer.
Many brave souls have taken up the banner of fighting against this horrible state of affairs. There has been a whole spade of replication attempts of high impact research findings, the open access movement aims to make it more possible to published failed replications, and many proposals have been put forth to change the way we make statistical inferences from our data. These are all large topics in themselves and I will probably tackle them in later posts on this blog.
For my first post though I instead want to focus on the preregistration of experimental protocols. This is the proposal that all basic science projects should be preregistered publicly with an outline of the scientific question and the experimental procedures, including the analysis steps. The rationale behind this idea is that questionable research practices, or even just fairly innocent flexibility in procedures (“researcher degrees of freedom”) that could skew results and inflate false positives, will be more easily controlled. The preregistration idea has been making the rounds during the past few years and it is beginning to be implemented both in the forms of open repositories and some journals. In addition to fixing the validity of published research, preregistration is also meant as an assurance that failed replications are published because acceptance – and publication – of a study does not hinge on how strong and clean the results are but only on whether the protocol was sound and whether it was followed.
These all sound like very noble goals and there is precedent for such preregistration systems in clinical trials. So what is wrong with this notion? Why does this proposal make the Devil’s Neuroscientist anxious?
Well, I believe it is horribly misguided, that it cannot possibly work, and that – in the best-case scenario – it will make no difference to science-society’s ills. I think that well-intentioned as the preregistration idea may be, it actually results from people’s ever shortening 21st century attention spans because they can’t accept that science is a gradual and iterative process that takes decades, sometimes centuries, to converge on a solution.
Basic science isn’t clinical research
There is a world of difference between the aims of basic scientific exploration and clinical trials. I can get behind the idea that clinical tests, say of new drugs or treatments, ought to be conservative and minimizing the false positives in the results. Flexibility in the way data are collected and analyzed, how outliers are treated, how side effects are assessed, and so on, can seriously hamper the underlying goal: finding a treatment that actually works well.
I can even go so far to accept that a similarly strict standard ought to be applied to preclinical research, say animal drug tests. The Devil’s Neuroscientist may work for the Evil One but she is not without ethics. Any research that is meant to test the validity of an approach that can serve the greater good should probably be held to a strict standard.
However, this does not apply to basic research. Science is the quest to explain how the universe works. Exploration and tinkering is at the heart of this endeavor. In fact, I want to see more of this, not less. In my experience (or rather my alter ego’s experience – the Devil’s Neuroscientist is a mischievous demon possessing Sam’s mind and at the time of writing this she is only a day old – but they share the same memories) it is one of the major learning experiences most graduate students and postdocs go through to analyze your data to death.
By tweaking all the little parameters, turning on every dial, and looking at a problem from numerous angles we can get a handle on how robust and generalizable our findings truly are. In every student’s life sooner or later there comes a point where this behavior leads to them to the conclusion their “results are spurious and don’t really show anything of interest whatsoever.” I know, because I have been to that place (or at least my alter ego has).
This is not a bad thing. On the contrary I believe it is actually essential for good science. Truly solid results will survive even the worst data massaging to borrow a phrase Sam’s PhD supervisor used to say. It is crucial that researchers really know their data inside out. And it is important to understand the many ways an effect can be made to disappear, and conversely the ways data massaging can lead to “significant” effects that aren’t really there.
Now this last point underlines that that data massaging can indeed be used to create false or inflated research findings, and this is what people mean when they talk about researcher degrees of freedom or questionable research practices. You can keep collecting data, peeking at your significance level at every step, and then stop when you have a significant finding (this is known as “optional stopping” or “data peeking”). This approach will quite drastically inflate the false positives in a body of evidence and yet such practice may be common. And there may be (much) worse things out there, like the horror story someone (and I have reason to believe them) told me of a lab where the standard operating mode was to run a permutation analysis by iteratively excluding data points to find the most significant result. I neither know who these people were nor where this lab is. I also don’t know if this practice went on with or without the knowledge of the principal investigator. It is certainly not merely a “questionable” research practice but it has crossed the line into outright fraudulence. As the person who told me of this pointed out, if someone is clever enough to do this, it seems likely that they also know that this is wrong. The only difference from doing this and actually making up your data from thin air and eating all the experimental stimuli (candy) to cover up the evidence is that it actually uses real data – but it might as well not for all the validity we can expect from that.
But the key thing to remember here is that this is deep inside the realm of the unethical, certainly on some level of scientific hell (and no, even though I work for the Devil this doesn’t mean I wish to see anybody in scientific hell). Preregistration isn’t going to stop fraud. Cheaters gonna cheat. Yes, the better preregistration systems actually require the inclusion of a “lab log” and possibly all the acquired data as part of the completed study. But does anyone believe that this is really going to work to stop a fraudster? Someone who regularly makes use of a computer algorithm to produce the most significant result isn’t going to bat an eyelid at dropping a few data points from their lab log. What is a lab log anyway? Of course we keep records of our experiments, but unless we introduce some (probably infeasible) Orwellian scheme in which every single piece of data is recorded in a transparent, public way (and there have been such proposals), there is very little to stop a fraudster from forging the documentation for their finished study. And you know what, even in that Big Brother world of science a fraudster would find a way to commit fraud.
Most proponents of preregistration know this and are wont to point out that preregistration isn’t to stop outright fraud but questionable research practices – the data massaging that isn’t really fraudulent but still inflates false positives. These practices may stem from the pressure to publish high impact studies or may even simply be due to ignorance. I certainly believe that data peeking falls into the latter category or at least did before it was widely discussed. I think it is also common because it is intuitive. We want to know if a result is meaningful and collect sufficient data to be sure of it.
The best remedy against such things is to educate people of what is and isn’t acceptable. It also underlines the importance of deriving better analysis methods that are not susceptible to data peeking. There have been many calls to abandon null hypothesis significance testing altogether. This discussion may be the topic of another post by the Devil’s Neuroscientist in the future as there are a lot of myths about this point. However, at this point I certainly agree that we can do better and that there are ways – which may or may not be Bayesian – to use a stopping criterion to improve the validity of scientific findings.
Rather than mainly sticking to a preregistered script, I think we should encourage researchers to explore the robustness of their data by publishing more additional analyses. This is what supplementary materials can be good for. If you remove outliers in your main analysis, show what the result looks like without this step or at least include the data. If you have several different approaches, show them all. The reader can make up their own mind about whether a result is meaningful and their judgment should be all the more correct the more information there is.
Most importantly, people should replicate findings and publish those replications. This is a larger problem and in the past it had been difficult to publish replications. This situation has changed a lot however and replication attempts are now fairly common even in high profile journals. No finding should ever be regarded as solid until it has stood the test of time after repeated and independent replication. Preregistration isn’t going to help with that. Making it more rewarding and interesting to publish replications will. There are probably still issues to be resolved regarding replication although I don’t think the situation is as dire as it is often made out to be (and this will probably be the topic of another post in the future).
Replications don’t require preregistration
The question of replication brings me to the next point. My alter ego has argued in the past that preregistration may be particularly suited for replication attempts. At the surface this seems logical because for a replication surely we want to stick as closely as possible to the original protocol so it is good to have that defined a priori.
While this is true, as I clawed my way out of the darkest reaches of my alter ego’s mind, it dawned on me that for replications the original experimental protocol is already published in the original study. All one needs to do is to follow the original protocol as closely as possible. Sure, in many publications the level of detail may be insufficient for a replication, which is in part due to the ridiculously low word limits of many journals.
However, the solution for this problem is not preregistration because preregistration doesn’t guarantee that the replication protocol is a close match to the original. Rather we must improve the level of detail of our methods sections. They are after all meant to permit replication. Fortunately, many journals that were particularly guilty of this problem have taken steps to change this. I don’t even mind if methods are largely published in online supplementary materials. A proper evaluation of a study requires close inspection of the methods but as long as they are easy to access I prefer detailed online methods to sparse methods sections inside published papers.
Proponents of preregistration also point out that having an experiment preregistered with a journal helps the publication of replication attempts because it ensures that the study will be published regardless of the outcome. This is perhaps true but at present there are not many journals that actually permit preregistration. It also forces the authors’ hands as to where to publish, which will be a turn off for many people.
Imagine that your finding fails to replicate a widely publicized, sensational result. Surely this will be far more interesting to the larger scientific community than if your findings confirm the previous study. Both are actually of equal importance and in truth having two studies investigating the same effect isn’t actually telling us much more about the actual effect than one study would. However, the authors may want to choose a high profile journal for one outcome but not for another (and the same applies to non-replication experiments). Similarly, the editors of a high impact journal will be more interested in one kind of result than another. My guess is PNAS was far keener to publish the failed replication of this study, than it would have been if the replication had confirmed the previous results.
While we still have the journal-based publication system or until we find another way for journals to decide where preregistered studies are published, preregistering studies with a journal forces the authors into a relationship with that journal. I predict that this is not going to appeal to many people.
Ensuring the quality of preregistered protocols
Of course, we don’t need to preregister protocols with a journal but we could have a central repository where such protocols are uploaded. In fact, there is already such a place in the Open Science Framework, and, always keen to foster mainstream acceptance, a trial registry has also been set up for parapsychology. This approach would free the authors with regard to where the final study is published but this comes at a cost: at least at present these repositories do not formally review the scientific quality of the proposals. At least at a journal the editors will invite expert reviewers to assess the merits of the proposed research and suggest possible changes to be implemented before the data collection even begins.
Theoretically this is also possible at a centralized repository, but this is currently not done. It would also cause a major burden on the peer review system. There already are an enormous number of research manuscripts out there that want to be reviewed by someone (just ask the editors at the Frontiers journals). This number would probably inflate that workload massively because it is substantially easier to draft a simple design document for an experiment than it is to write up a fully-fledged study with results, analysis, and interpretation. Incidentally, this is yet another way in which clinical trials differ from basic research: in clinical trials you presumably already have a treatment or drug whose efficacy we want to assess. This limits the number of trials somewhat at least. In basic research all bets are off – you can have as many ideas for experiments as your imagination permits.
So what we will be left with is lots of preregistered experimental protocols that are either reviewed shoddily or not at all. Sam recently reviewed an EEG study claiming to have found neural correlates of telepathy. All the reviews are public so everyone can read them. The authors of this study actually preregistered their experimental protocol at the Open Science Framework. The protocol was an almost verbatim copy of an earlier pilot study the authors had done, plus some minor changes that were added in the hope to improve the paradigm. However, the level of detail in the methods was so sparse and obscure that it made assessment of the research, let alone an actual replication attempt, nigh impossible. There were also some fundamental flaws in the analysis approach that indicated that the entire results, including those from the pilot experiment, were purely artifactual. In other words, the protocol might as well not have been preregistered.
More recently, another study used preregistration (again without formal review) to conduct a replication attempt of a series of structural brain-behavior experiments. You can find a balanced and informative summary of the findings at this blog and at the bottom you will find an extensive discussion, including comments by my alter ego. What these authors did was to preregister the experimental protocol by uploading it to the webpage of one of the authors. They also sent the protocol to the authors of the original studies they wanted to replicate to seek their feedback. Minimal (or lack of) response was then assumed to be tacit agreement that the protocol was appropriate.
The scientific issues of this discussion are outside the scope of this post. Briefly, it turns out that at least for some of the experiments in this study the methods are only a modest match with those of the original studies. This should be fairly clear from reading the original methods sections. Whether or not the original authors “agreed” with the replication protocols (and it remains opaque just what that means exactly), there is already a clear departure from the actually predefined script at the very outset.
It is of course true that a finding should be generalizable to be of importance and robustness to certain minor variations in the approach should be part of that. For example, it was recently argued by Daryl Bem that failure to replicate his precognition results was due to the fact that the replicators did not use his own stimulus software for the replication. This is not a defensible argument because to my knowledge the replication followed the methods outlined in the original study fairly closely. Again, this is what methods sections are for. Therefore, if the effect can really only be revealed by using the original stimulus software this at the very least suggests that it doesn’t generalize. Before drawing any further conclusions it is therefore imperative to understand where the difference lies. It could certainly be that the original software is somehow better at revealing the effect, but it could also mean that it has a hidden flaw resulting in an artifact.
The same isn’t necessarily true in the case of these brain-behavior correlations. It could be true, but at present we have no way of knowing it. The methods of the original study as published in the literature weren’t adhered to, so it is incorrect to even call this a direct replication. Some of the discrepancies could very well be the reason why the effect disappears in the replication or conversely could also introduce a spurious effect in the original studies.
This is where we come back to preregistration. One of the original authors was actually also a reviewer of this replication study of these brain-behavior correlations. His comments are also included on that blog and he further elaborates on his reviews in the discussion on that page. He suggests that he proposed additional analyses to the replicators that are actually a closer match to the original methods but they refused to conduct these analyses because they are exploratory and thus weren’t part of the preregistered protocol. However, this is odd because several additional exploratory analyses are included (and clearly labeled as such) in the replication study. Moreover, the original author suggests he conducted the analysis he suggested on the replication data and that this in fact confirms the original findings. Indeed, another independent successful replication of one of these findings was published but not taken into account by this replication. As such it seems odd that only some of the exploratory methods are included in this replication and some more cynical than the Devil’s Neuroscientist (who is already pretty damn cynical) might call that cherry picking.
What this example illustrates is that it is actually not very straightforward to evaluate and ideally improve a preregistered protocol. First of all, sending out the protocol to (some) of the original authors is not the same as obtaining solid agreement that the methods are appropriate. Moreover, not taking suggestions on board at a later stage, when data collection/analysis has already commenced, is hindering good science. Take again my earlier example of the EEG study Sam reviewed. In his evaluation the preregistered protocol was fundamentally flawed, resulting in completely spurious findings. The authors revised their manuscript and performed different analyses Sam suggested, which essentially confirmed that there was no evidence of telepathy (although the authors never quite got to the point of conceding this).
Now unlike the Devil’s Neuroscientist, Sam is merely a fallible human and any expert can make mistakes. So you should probably take his opinion with a grain of salt. However, I believe in this case his view of that experiment is entirely correct (but then again I’m biased). What this means is that under a properly adhered to preregistration system the authors would have to perform the preregistered procedures even though they are completely inadequate. Any improvements, no matter how essential, will have to be presented as “additional exploratory procedures”.
This is perhaps a fairly extreme case but not unrealistic. There may be many situations where a collaborator or a reviewer (assuming the preregistered protocol is public) will suggest an improvement over the procedures after data collection has started. In fact, if a reviewer makes this suggestion then I would regard this as unproblematic to alter the design post-hoc. After all, preregistration is supposed to stop people from data massaging, not from making independently advised improvements to the methods. Certainly, I would rather see well designed, meticulously executed studies that have a level of flexibility, than a preregistered protocol that is deeply flawed.
The thin red line
Before I conclude this (rather long) first post to my blog, I want to discuss another aspect of preregistration that I predict will be its largest problem. As many proponents of preregistration do not tire to stress, preregistration does not preclude exploratory analyses or even whole exploratory studies. However, as I discussed in the previous sections, there are actually complications with this. My prediction is that almost all studies will contain a large amount of exploration. In fact, I am confident that the best studies will contain the most exploration because – as I wrote earlier – thorough exploration of the data is actually natural, it is useful, and it should be encouraged.
For some studies it may be possible to predict many of the different angles from which to analyze the data. It may also be acceptable to modify small mistakes in the preregistered design post-hoc and clearly label these changes. However, by and large what I believe will happen is that we will have a large number of preregistered protocols but that the final publications will in fact contain a lot of additional exploration, and that some of the scientifically most interesting information will usually be in there.
How often have you carried out an experiment only to find that your beautiful hypotheses aren’t confirmed, that the clear predictions you made do not pan out? These are usually the most interesting results because they entice you to dig deeper, to explore your data, and generate new, better, more exciting hypotheses. Of course, none of this is prevented by preregistration but in the end the preregistered science is likely to be the least interesting part of the literature.
But there is also another alternative. Perhaps we are enforcing preregistration more strictly. Perhaps only a very modest amount of exploration will be permitted after all. Maybe preregistered protocols will have to contain every detailed step in your methods, including the metal screening procedure for MRI experiments, exact measurements of the ambient light level, temperature, and humidity in the behavioral testing room, and the exact words each experimenter will say to each participant before, during, and after the actual experiment, without any room for improvisation (or, dare I say, natural human interaction). It may be that only preregistered studies that closely follow their protocol will be regarded as good science.
This alternative strikes me as a nightmare scenario. Not only will this stifle creativity and slow the already gradual progress of science down to a glacial pace, it will also rob science of the sense of wonder that attracted many of us to this underpaid job in the first place.