Of self-correction and selfless errors

I had originally planned to discuss today’s topic at a later point, perhaps as part of my upcoming post about the myths of replication. However, discussions surrounding my previous posts, as well as the on-going focus on the use of posthoc power analysis in the literature, led me to the decision to address this point now.

A central theme of my previous two posts was the notion that science is self-correcting and that replication and skepticism are the best tools at our disposal. Discussions my alter ego Sam has had with colleagues, as well as discussions in the comment section on this blog and elsewhere, reveal that many from the ranks of the Crusaders for True Science call that notion into question. In that context, I would like to thank a very “confused” commenter on my blog for referring me to an article I hadn’t read, which is literally entitled “Why science is not necessarily self-correcting?“. I would also like to thank Greg Francis who commented on his application of statistical tests to detect possible publication bias in the literature. Recently I also became aware of another statistical procedure based on assumptions about statistical power called the Replication Index that was proposed as an alternative to the Test of Excess Significance used by Francis. I think these people are genuinely motivated by a selfless desire to improve the current state of science. This is a noble goal but I think it is fraught with errors and some potentially quite dangerous misunderstandings.

The Errors of Meta-Science

I will start with the statistical procedures for detecting publication bias and the assertion that most scientific findings are false positives. I call this entire endeavor “meta-science” because this underlines the fundamental problem with this whole discussion. As I pointed out in my previous post, science is always wrong. It operates like a model-fitting procedure to gradually improve the explanatory and predictive value of our attempts to understand a complex universe. The point that people are missing in this entire debate about asserted false positive rates, non-reproducibility, and publication bias is that the methods used to make these assertions are themselves science. Thus these procedures suffer from the same problems as any scientific effort in that they seek to approximate the truth but can never actually hope to reach it. By using scientific methods to understand the workings of the scientific method, the entire logic of this approach is circular.

Circular inference has recently received a bit of attention within neuroscience. I don’t know if the authors of this paper actually coined the term “voodoo correlations”. Perhaps they merely popularized it. The same logical fallacy has also been called “double-dipping“. However, all this is really simply circular reasoning and somewhat related to “begging the question“. It is a more a problem with flawed logic than with science. Essentially, it is what happens when you use the same measurements to test the validity of your predictions as you did for making the predictions in the first place.

This logical fallacy can result in serious errors. However, in the real world this isn’t actually entirely avoidable and it isn’t always problematic as long as we are aware of its presence. For instance, a point most people are missing is that whenever they report something like t(31)=5.2, p<0.001, or a goodness-of-fit statistic, etc, they are using circular inference. You are reporting an estimate of the effect size (be it a t-statistic, the goodness-of-fit, Cohen’s d or others) based on the observed data and then draw some sort of general conclusion from it. The goodness of a curve fit is literally calculated using the accuracy with which the model predicts the observed data. Just observing an effect size, say, a difference in some cognitive measure between males and females, can only tell you that this difference exists in your sample. You can make some probabilistic inferences about how this observed effect may generalize to the larger population and this is what statistical procedures do – however, in truth you cannot know what an effect means for the general population until you checked your predictions through empirical observations.

There are ways to get out of this dilemma, for example through cross-validation procedures. I believe this should be encouraged, especially whenever a claim about the predictive value of a hypothesis is made. More generally, replication attempts are of course a way to test predictions from previous results. Again, we should probably encourage more of that and cross-validation and replication can ideally be combined. Nevertheless, the somewhat circular nature of reporting observed effect sizes isn’t necessarily a major problem provided we keep in mind what an effect size estimate can tell us and what it can’t.

The same applies to the tests employed by meta-science. These procedures take an effect size estimate from the scientific literature, calculate the probability that this effect could have occurred under the conditions (statistical power), and then make inferences on this posthoc probability. The assumptions on which these procedures are based remain entirely untested. In so far that they make predictions at all, such as whether the effect is likely to be replicated in future experiments, no effort is typically made to test them. Statistical probabilities are not a sufficient replacement for empirical tests. You can show me careful, mathematically coherent arguments as to why some probability should be such and such – if the equation is based on flawed assumptions and/or it doesn’t take into account some confounds, the resulting conclusions may be untenable. This doesn’t necessarily mean that the procedure is worthless. It is simply like all other science. It constructs an explanation for the chaotic world out there that may or may not be adequate. It can never be a perfect explanation and we should not treat it as if it were the unadulterated truth. This is really my main gripe with meta-science: Proponents of these meta-science procedures treat them as if they were unshakable fact and the conviction with which some promote these methods borders on religious zeal.

One example is the assertion that a lot of scientific findings are false positives. This argument is based on the premise that many published experiments are underpowered and that publication bias (which we know exists because research is actively seeking positive results) means that mainly positive findings are reported. In turn this may explain what some have called the “Decline Effect“, that is, initial effect size estimates are inflated and they gradually decrease and approach the true effect size as more and more data are collected.

I don’t deny that lack of power and publication bias probably exist. However, I also think that current explanations are insufficient for explaining all the data. There are other reasons that may cause a reduction of effect size estimates as time goes on and more and more attempts at replication are made. Few of them are ever formally taken into account by meta-science models, partly because they are notoriously difficult to quantify. For instance, there is the question of whether all experiments are comparable. Even with identical procedures carried out by meticulous researchers, the quality of reagents, of experimental subjects, or generally the data we measure can differ markedly. I think this may be a particularly bad problem in psychology and cognitive neuroscience research although it probably exists in many areas of science. I call this the Data Quality Decay Function:

SubjectQualityTake for example the reliability and quality of data that can be expected from research subjects. In the early days after the experiment was conceived we test people drawn from subject pools of reliable research subjects. If it is a psychophysical study on visual perception, chances are that the subjects are authors on the paper or at least colleagues who have considerable experience with doing experiments. The data reaped from such subjects will be clean, low noise estimates of the true effect size. The cynical might call these kinds of subjects “representative” and possibly even “naive”, provided they didn’t co-author the paper at least.

As you add more and more subjects the recruitment pool will inevitably widen. At first there will be motivated individuals who don’t mind sitting in dark rooms staring madly at a tiny dot while images are flashed up in their peripheral vision. Even though the typical laboratory conditions of experiments are a long way from normal everyday behavior, people like this will be engaged and willing enough to perform reasonably on the experiment but they may fatigue more easily than trained lab members. Moreover, sooner or later your subject pool will encompass subjects with low motivation. There will be those who only take part in your study because of the money they get paid or (even worse) because they are coerced by course credit requirements. There may even be professional subjects who participate in several experiments within the same day. You can try to control this but it won’t be fool-proof, because in the end you’ll have to take them by their word even if you ask them whether they have participated in other experiments. And to be honest, professional subjects may be more reliable than inexperienced ones so it can be worthwhile to test them. Also, frequently you just don’t have the luxury to turn away subjects. I don’t know about your department but people weren’t exactly kicking in the doors of any lab I have seen just to participate in experiments.

Eventually, you will end up testing random folk off the street. This is what you will want if you are actually interested in generalizing your findings to the human condition. Ideally, you will test the effect in a large, diverse, multicultural, multiethnic, multiracial sample that encompasses the full variance of our species (this very rarely happens). You may even try to relax the strict laboratory conditions of earlier studies. In fact you’ll probably be forced to because Mongolian nomads or Amazonian tribeswomen, or whoever else your subject population may be, just don’t tend to hang around psychology departments in Western cities. The effect size estimate under these conditions will almost inevitably be smaller than those in the original experiments because of the reduced signal-to-noise ratio. Even if the true biological effect is constant across humanity, the variance will be greater.

This last point highlights why it isn’t so straightforward to say “But I want my findings to generalize so the later estimate reflects the truth more accurately”. It really depends on what your research question is. If you want to measure an effect and make general predictions as to what this means for other human beings, then yes, you should test as wide a sample as possible and understand why any meaningful effect size is likely to be small. Say, for instance, we are testing the efficacy of a new drug and characterize its adverse effects. Such experiments should be carried out on a wide-ranging sample to understand how differences between populations or individual background can account for side effects and whether the drug is even effective. You shouldn’t test a drug only on White Finnish men only to find out later that it is wholly useless or even positively dangerous in Black Carribean women. This is not just a silly example – this sort of sampling bias can be a serious concern.

On the other hand, when you are testing a basic function of the perceptual system in the human brain, testing a broad section of the human species is probably not the wisest course of action. My confidence in psychophysical results produced by experienced observers, even if they are certifiably non-naive and anything-but-blind to the purpose of the experiment (say, because they are the lead author of the paper and coded the experimental protocol), can still be far greater than it would be for the same measurements from individuals recruited from the general population. There are myriad factors influencing the latter that are much more tightly controlled in the former. Apart from issues with fatigue and practice with the experimental setting, they also may simply not really know what to look for. If you cannot produce an accurate report of your perceptual experience, you aren’t going to produce an accurate measurement of it.

Now this is one specific example and it obviously does not have to apply to all cases. I am pretty confident that the Data Quality Decay Function exists. It occurs for research subjects but it could also relate to reagents that are being reused, small errors in a protocol that accumulate over time, and many other factors. In many situations the slope of the curve may be very shallow so that the decay is really non-existent. There are also likely to be other factors that may counteract and, in some cases, invert the function. For instance, if the follow-up experiments actually improve the methodology of an experiment the data quality might even be enhanced. This is certainly the hope we have for science in general – but this development may take a very long time.

The point is, we don’t really know very much about anything. We don’t know how data quality, and thus effect size estimates, vary with time, across samples, between different experimenters, and so forth. What we do know is that under the most common assumptions (e.g. Gaussian errors of equal magnitude across groups) the sample sizes we can realistically use are insufficient for reliable effect size estimates. The main implication of the Data Quality Decay Function is that the effect size estimates under standard assumptions are probably smaller than the true effect.

While I am quite a stubborn lady, as I said earlier, I am not so stubborn to think this is the sole explanation. We know publication bias exists and so it is almost inevitable that it affects effect sizes in the literature. I also think that even if some of the procedures used to infer false positive rates and publication bias are based on untested assumptions and on logically flawed posthoc probabilities, they reveal some truths. All meta-science is wrong – but that doesn’t make it wholly worthless. I just believe we should take it with a grain of salt and treat it like all other science. In the long run, meta-science will self-correct.

LocalMinimum

Sometimes when you’re in a local minimum the view is just better

Self-correction is a fact

This brings me to the other point of today’s post, the claim that self-correction in science is a myth. I argue that self-correction is inherent to the scientific process itself. All the arguments against self-correction I have heard are based on another logical fallacy. People may say that the damn long time it took the scientific community to move beyond errors like phrenology or racial theories demonstrates that science does not by itself correct its mistakes. They suggest that because particular tools, e.g. peer review or replication, failed to detect serious errors or even fraudulent results, means that science itself does not weed out such issues.

The logical flaw here is that all of these things are effectively point estimates from a slow time series. It is the same misconception as to why people deny that global temperatures are rising because we have had some particularly cold winters in some specific years in fairly specific corners of the Earth or the error that leads creationists to claim that evolution has never been observed directly. It is why previous generations of scientists found it so hard to accept the thought that the surface of the Earth comprises tectonic plates floating on liquid magma. Fortunately, science has already self-corrected that latter misconception and seismology and tectonics are widely accepted theories well beyond the scientific community. Sadly, evolution and climate change have not arrived at the same level of mainstream acceptance.

It seems somewhat ironic that we as scientists should find it so difficult to understand that science is a gradual, slow process. After all we are all aware of evolutionary, geological, and astronomical time scales. However, in the end scientists are human and thus subject to the same perceptual limits and cognitive illusions as the rest of our species. We may get a bit of an advantage compared to other people who simply never need to think about similar spatial and temporal dimensions. But in the end, our minds aren’t any better equipped to fathom the enormity and age of the cosmos than anybody else’s.

Science is self-correcting because that is what science does. It is the constant drive to seek better answers to the same questions and to derive new questions that can provide even better answers. If the old paradigms are no longer satisfactory, they are abandoned. It can and it does happen all the time. Massive paradigm shifts may not be very frequent but that doesn’t mean they don’t happen. As I said last time, science does of course make mistakes and these mistakes can prevail for centuries. Using again my model-fitting analogy one would say that the algorithm gets stuck in a “local minimum“. It can take a lot of energy to get out of that but given enough time and resources it will happen. It could be a bright spark of genius that overthrows accepted theories. It could be that the explanatory construct of the status quo becomes so overloaded that it collapses like a house of cards. Or sometimes it may simply be a new technology or method that allows us to see things more clearly than before. Sometimes dogmatic, political, religious, or other social pressure can delay progress, for example, for a long time your hope of being taken seriously as a woman scientist was practically nil. In that case, what it takes to move science forward may be some fundamental change to our whole society.

Either way, bemoaning the fact that replication and skeptical scrutiny haven’t solved all problems and managed to rectify every erroneous assumption and refute every false result is utterly pointless. Sure, we can take steps to ensure that the number of false positives is reduced but don’t go so far to make it close to impossible to detect new important results. Don’t make the importance of a finding dependent on it being replicated hundreds of times first. We need replication for results to stand the test of time but scientists will always try to replicate potentially important findings. If nobody can be bothered to replicate something, it may not be all that useful – at least at the time. Chances are that in 50 or 100 or 1000 years the result will be rediscovered and prove to be critical and then our descendants will be glad that we published it.

By all means, change the way scientists are evaluated and how grants are awarded. I’ve said it before but I’ll happily repeat it. Immediate impact should not be the only yardstick by which to measure science. Writing grant proposals as catalogs of hypotheses when some of the work is inevitably exploratory in nature seems misguided to me. And I am certainly not opposed to improving our statistical practice, ensuring higher powered experiments, and encouraging strategies for more replication and cross-validation approaches.

However, the best and most important thing we can do to strengthen the self-correcting forces of science is to increase funding for research, to fight dogma wherever it may fester, and to train more critical and creative thinkers.

“In science it often happens that scientists say, ‘You know that’s a really good argument; my position is mistaken,’ and then they would actually change their minds and you never hear that old view from them again. They really do it. It doesn’t happen as often as it should, because scientists are human and change is sometimes painful. But it happens every day. I cannot recall the last time something like that happened in politics or religion.”

Carl Sagan

Why all research findings are false

(Disclaimer: For those who have not seen this blog before, I must again point out that the views expressed here are those of the demonic Devil’s Neuroscientist, not those of the poor hapless Sam Schwarzkopf whose body I am possessing. We may occasionally agree on some things but we disagree on many more. So if you disagree with me feel free to discuss with me on this blog but please leave him alone)

In my previous post I discussed the proposal that all¹ research studies should be preregistered. This is perhaps one of the most tumultuous ideas that are being pushed as a remedy for what ails modern science. There are of course others, such as the push for “open science”, that is, demands for free access to all publications, transparent post-publication review, and sharing of all data collected for experiments. This debate has even become entangled with age-old faith wars about statistical schools of thought. Some of these ideas (like preregistration or whether reviews should be anonymous) remain controversial and polarizing, while others (like open access to studies) are so contagious that they have become almost universally accepted up to the point that disagreeing with such well-meaning notions makes you feel like you have the plague. On this blog I will probably discuss each of these ideas at some point. However, today I want to talk about a more general point that I find ultimately more important because this entire debate is just a symptom of a larger misconception:

Science is not sick. It never has been. Science is how we can reveal the secrets of the universe. It is a slow, iterative, arduous process. It makes mistakes but it is self-correcting. That doesn’t mean that the mistakes don’t sometimes stick around for centuries. Sometimes it takes new technologies, discoveries, or theories (all of which are of course themselves part of science) to make progress. Fundamental laws of nature will perhaps keep us from ever discovering certain things, say, what happens when you approach the speed of light, leaving them for theoretical consideration only. But however severe the errors, provided our species doesn’t become extinct through cataclysmic cosmic events or self-inflicted destruction, science has the potential to correct them.

Also science never proves anything. You may read in the popular media about how scientists “discovered” this or that, how they’ve shown certain things, or how certain things we believe turn out to be untrue. But this is just common parlance for describing what scientists actually do: they formulate hypotheses, try to test them by experiments, interpret their observations, and use them to come up with better hypotheses. Actually, and quite relevant to the discussion about preregistration, this process frequently doesn’t start with the formulation of hypotheses but with making chance observations. So a more succinct description of a scientist’s work is this: we observe the world and try to explain it.

Science as model fitting

In essence, science is just a model-fitting algorithm. It starts with noisy, seemingly chaotic observations (the black dots in the figures below) and it attempts to come up with a model that can explain how these observations came about (the solid curves). A good model can then make predictions as to how future observations will turn out. The numbers above the three panels in this figure indicates the goodness-of-fit, that is, how good an explanation the model is for the observed data. Numbers closer to 1 denote better model fits.

CurveFitting

It should be immediately clear that the model in the right panel is a much better description of the relationship between data points on the two axes than the other panels. However, it is also a lot more complex. In many ways, the simple lines in the left or middle panel are much better models because they will allow us to make predictions that are far more likely to be accurate. In contrast, for the model in the right panel, we can’t even say what the curve will look like if we move beyond 30 on the horizontal axis.

One of the key principles in the scientific method is the principle of parsimony, also often called Occam’s Razor. It basically states that whenever you have several possible explanations for something, the simplest one is probably correct (it doesn’t really say it that way but that’s the folk version and it serves us just fine here). Of course we should weigh the simplicity of an explanation against it’s explanatory or predictive power. The goodness-of-fit of the middle panel is better than that of the left panel, although not by much. Nevertheless, it isn’t that much more complex than the simple linear relationship shown in the left panel. So we could perhaps accept the middle panel as our best explanation – for now.

The truth though is that we can never be sure what the true underlying explanation is. We can only collect more data and see how well our currently favored models do in predicting them. Sooner or later we will find that one of the models is just doing better than all the others. In the figure below the models fitted to the previous observations are shown as red curves while the black dots are new observations. It should have become quite obvious that the complex model in the right panel is a poor explanation for the data. The goodness-of-fit on these new observations for this model is now much poorer than for the other two. This is because this complex model was actually overfitting the data. It tried to come up with the best possible explanation for every observation instead of weighing explanatory power against simplicity. This is probably kind of what is going on in the heads of conspiracy theorists. It is the attempt to make sense of a chaotic world without taking a step back to think whether there might not be simpler explanations and whether our theory can make testable predictions. However, as extreme as this case may look, scientists are not immune from making such errors either. Scientists are after all human.

CrossValidation

I will end the model fitting analogy here. Suffice it to say that with sufficient data it should become clear that the curve in the middle panel is the best-fitting of the three options. However, it is actually also wrong. Not only is the function used to model the data not the one that was actually used to generate the observations, but the model also cannot really predict the noise, the random variability spoiling our otherwise beautiful predictions. Even in the best-fitting case the noise prevents us from predicting future observations perfectly. The ideal model would not only need to describe the relationship between data points on the horizontal and vertical axes but it would have to be able to predict that random fluctuation added on top of it. This is unfeasible and presumably impossible without a perfect knowledge of the state of everything in the universe from the nanoscopic to the astronomical scale. If we tried this it will most likely look like the overfitted example in the right panels. Therefore this unexplainable variance will always remain in any scientific finding.

Horizon

A scientist will keep swimming to find what lies beyond that horizon

Science is always wrong

This analogy highlights why the fear of incorrect conclusions and false positives that has germinated in recent scientific discourse is irrational and misguided. I may have many crises but reproducibility isn’t one of them. Science is always wrong. It is doomed to always chase a deeper truth without any hope of ever reaching it. This may sound bleak but it truly isn’t. Being wrong is inherent to the process. This is what makes science exciting. These ventures into the unknown drives most scientists, which is why many of us actually like the thought of getting up in the morning and going to work, why we stay late in the evening trying solve problems instead of doing something more immediately meaningful, and why we put up with pitifully low salaries compared to our former classmates who ended up getting “real jobs”. It is also the same daring and curiosity that drove our ancestors to invent tools, discover fire, and to cross unforgiving oceans in tiny boats made out of tree trunks. Science is an example of the highest endeavors the human spirit is capable of (it is not the only one but this topic is outside the scope of this blog). If I wanted unwavering certainty that I know the truth of the world, I’d have become a religious leader, not a scientist.

Now one of the self-declared healers of our “ailing” science will doubtless interject that nobody disagrees with me on this, that I am just being philosophical, or playing with semantics. Shouldn’t we guarantee, or so they will argue, that research findings are as accurate and true as they can possibly be? Surely, the fact that many primary studies, in particular those in high profile journals, are notoriously underpowered is cause for concern? Isn’t publication bias, the fact that mostly significant findings are published while null findings are not, the biggest problem for the scientific community? It basically means that we can’t trust a body of evidence because even in the best-case scenario the strength of evidence is probably inflated.

The Devil’s Neuroscientist may be evil and stubborn but she² isn’t entirely ignorant. I am not denying that some of the issues are problematic. But fortunately the scientific method already comes with a natural resistance, if not a perfect immunity, against these issues: skepticism and replication. Scientists use them all the time. For those people who have not quite managed to wrap their heads around the fact that I am not my alter ego, Sam Schwarzkopf, will say that I am sounding like a broken record³. While Sam and my humble self don’t see eye to eye on everything we probably agree on these points as he has repeatedly written about this in recent months. So as a servant of the devil, perhaps I sound like a demonic Beatles record: noitacilper dna msicitpeks.

There are a lot of myths about replication and reproducibility and I will write an in-depth post about that at a future point. Briefly though let me stress that, evil as I may be, I believe that replication is a corner stone of scientific research. Replication is the most trustworthy test for any scientific claim. If a result is irreplicable, perhaps because the experiment was just a once-in-an-age opportunity, because it would be too expensive to do twice, or for whatever other reason, then it may be interesting but it is barely more than an anecdote. At the very least we should expect pretty compelling evidence for any claims made about it.

Luckily, for most scientific discoveries this is not the case. We have the liberty and the resources to repeat experiments, with or without systematic changes, to understand the factors that govern them. We should and can replicate our own findings. We can and should replicate other people’s findings. The more we do of this the better. This doesn’t mean we need to go on a big replication rampage like the “Many Labs” projects. Not that I have anything against this sort of thing if people want to spend their time in this way. I think for a lot of results this is probably a waste of time and resources. Rather I believe we should encourage a natural climate of replication and I think it already exists although it can be enhanced. But as I said, I will specifically discuss replication in a future post so I will leave this here.

Instead let me focus on the other defense we have at our disposal. Skepticism is our best weapon against fluke results. You should never take anything you read in a scientific study at face value. If there is one thing every scientist should learn it is this. In writing scientific results look more convincing and “cleaner” than they are when you’re in the middle of experiments and data analysis. And even for those (rare?) studies with striking data, insurmountable statistics, and the most compelling intellectual arguments you should always ask “Could there be any other explanation for this?” and “What hypothesis does this finding actually disprove?” The latter question underlines a crucial point. While I said that science never proves anything, it does disprove things all the time. This is what we should be doing more of and we should probably start with our own work. Certainly, if a hypothesis isn’t falsifiable it is pretty meaningless to science. Perhaps a more realistic approach was advocated by Platt in his essay “Strong Inference“. Instead of testing whether one hypothesis is true we should pit two or more competing hypotheses against each other. In psychology and neuroscience research this is actually not always easy to do. Yet in my mind it is precisely the approach that some of the best studies in our field take. Doing this immunizes you from the infectiousness of dogmatic thinking because you no longer feel the need to prove your little pet theory and you don’t run control experiments simply to rule out trivial alternatives. But admittedly this is often very difficult because typically one of the hypotheses is probably more exciting…

The point is, we should foster a climate of where replication and skepticism are commonplace. We need to teach self-critical thinking and reward it. We should encourage adversarial collaborative replication efforts and the use of multiple hypotheses wherever possible. Above all we need to make people understand that criticism in science is not a bad thing but essential. Perhaps part of this involves training some basic people skills. It should be possible to display healthy, constructive skepticism without being rude and aggressive. Most people have stories to tell of offensive and irritating colleagues and science feuds. However, at least in my alter ego’s experience, most scientific disagreements are actually polite and constructive. Of course there are always exceptions: reviewer 2 we should probably just shoot into outer space.

What we should not do is listen to some delusional proposals about how to evaluate individual researchers, or even larger communities, by the replicability and other assessments of the truthiness of their results. Scientists must accept that we are ourselves mostly wrong about everything. Sometimes the biggest impact, in as far as that can be quantified, is not made by the person who finds the “truest” finding but by whoever lays the groundwork for future researchers. Even a completely erroneous theory can give some bright mind the inspiration for a better one. And even the brightest minds go down the garden path sometimes. Johannes Kepler searched for a beautiful geometry of the motion of celestial bodies that simply doesn’t exist. That doesn’t make it worthless as his work was instrumental for future researchers. Isaac Newton wasted years of his life dabbling in alchemy. And even on the things he got “right”, describing the laws governing motion and gravity, he was also really kind of wrong because his laws only describe a special case. Does anyone truly believe that these guys didn’t make fundamental contributions to science regardless of what they may have erred on?

We hope our pilot experiments won't all crash and burn

May all your pilot experiments soar over the clouds like this, not crash and burn in misery

Improbability theory

Before I will leave you all in peace (until the next post anyway), I want to make some remarks about some of the more concrete warnings about the state of research in our field. A lot of words are oozing out of the orifices in certain corners about the epidemic of underpowered studies and the associated spread of false positives in the scientific literature. Some people put real effort into applying statistical procedures to whole hosts of published results to reveal the existence of publication bias or “questionable research practices”. The logic behind these tests is that the aggregated power over a series of experiments makes it very improbable that statistically significant effects could be found in all of them. Apparently, this test flags up an overwhelming proportion of studies in some journals as questionable.

I fail to see the point of this. First of all, what good will come from naming and shaming studies/researchers who apparently engaged in some dubious data massaging, especially when, as we are often told, these problems are wide-spread? One major assertion that is then typically made is that the researchers ran more experiments than they reported in the publication but that they chose to withhold the non-significant results. While I have no doubt that this does in fact happen occasionally, I believe it is actually pretty rare. Perhaps it is because Sam, whose experiences I share, works in neuroimaging where it would be pretty damn expensive (both in terms of money and time investment) to run lots of experiments and only publishing the significant or interesting ones. Then again, he certainly has heard of published fMRI studies where a whopping number of subjects were excluded for no good reason. So some of that probably does exist. However, he was also trained by his mentors to believe that all properly executed science should be published and this is the philosophy by which he is trying to conduct his own research. So unless he is somehow rare in this or behavioral/social psychology research (about which claims of publication bias are made most often) are for some reason much worse than other fields, I don’t think unreported experiments are an enormous problem.

What instead might cause “publication bias” is the tinkering that people sometimes do in order to optimize their experiments and/or maximize the effects they want to measure. This process is typically referred to as “piloting” (not sure why really – what does this have to do with flying a plane?). It is again highly relevant to our previous discussion or preregistration. This is perhaps the point where preregistration of an experimental protocol might have its use: First do lots of tinker-explore-piloting to optimize the ways to address an experimental question. Then preregister this optimized protocol to do a real study to answer the question but strictly follow the protocol. Of course, as I have argued last time, instead you could just publish the tinkered experiments and then you or someone else can try to replicate using the previously published protocol. If you want to preregister those efforts, be my guest. I am just not convinced it is necessary or even particularly helpful.

Thus part of the natural scientific process will inevitably lead to what appears like publication bias. I think this is still pretty rare in neuroimaging studies at least. Another nugget of wisdom about imaging Sam has learned from his teachers, and which is he is trying to impart on his own students, is that in neuroimaging you can’t just constantly fiddle with your experimental paradigm. If you do so you will not only run out of money pretty quickly but also end up with lots of useless data that cannot be combined in any meaningful way. Again, I am sure some of these things happen (maybe some people are just really unscrupulous about combining data that really don’t belong together) but I doubt that this is extremely common.

So perhaps the most likely inflation of effect sizes in a lot of research stems from questionable research practices often called “p-hacking”, for example trying different forms of outlier removal or different analysis pipelines and only reporting the one producing the most significant results. As I discussed previously, preregistration aims to control for this by forcing people to be upfront about which procedures they planned to use all along. However, a simpler alternative is to ask authors to demonstrate the robustness of their findings across a reasonable range of procedural options. This achieves the same thing without requiring the large structural change of implementing a preregistration system.

However, while I believe some of the claims about inflated effect sizes in the literature are most likely true, I think there is a more nefarious problem with the statistical approach to inferring such biases. It lies in its very nature, namely that it is based on statistics. Statistical tests are about probabilities. They don’t constitute proof. Just like science at large, statistics never prove anything, except perhaps for the rare situations where something is either impossible or certain – which typically renders statistical tests redundant.

There are also some fundamental errors in the rationale behind some of these procedures. To make an inference about the power of an experiment based on the strength of the observed result is to incorrectly assign a probability to an event after it has occurred. The probability of an observed event occurring is 1 – it is completely irrelevant how unlikely it was a priori. Proponents of this approach try to weasel out of this conundrum by stating that they assume the true effect size to be of a similar magnitude as what was observed in the published experiment and using this as the assumed power of the experiment. This assumption is untenable because the true effect size is almost certainly not that which was observed. There is a lot more to be said about this state of affairs but I won’t go into this because others have already summarized many of the arguments about this much better than I could.

In general I simply wonder how good statistical procedures actually are at estimating true underlying effects in practice. Simulations are no doubt necessary to evaluate a statistical method because we can work with known ground truths. However, they can only ever be approximations to real situations encountered in experimental research. While the statistical procedures for publication bias probably seem to make sense in simulations, their true experimental validity actually remains completely untested. In essence, they are just bad science because they aim to show an effect without a control condition, which is really quite ironic. The very least I would expect to see from these efforts is some proof that these methods actually work for real data. Say we set up a series of 10 experiments for an effect we can be fairly confident actually exists, for example the Stroop effect or the fact visual search performance for a feature singleton is independent of set size while searching for a conjunction of features is not. Will all or most of these 10 experiments come out significant? And if so, will the “excess significance test” detect publication bias?

Whatever the outcome of such experiments on these tests, one thing I already know: any procedure that claims to find evidence that over four of five published studies should not be believed is not to be believed. While we can’t really draw firm conclusions from this, the fact that this rate is the same in two different applications of this procedure certainly seems suspicious to me. Either it is not working as advertised or it is detecting something trivial we should already know. In any case, it is completely superfluous.

I also want to question a more fundamental problem with this line of thinking. Most of these procedures and demonstrations of how horribly underpowered scientific research is seems to make a very sweeping assumption: that all scientists are generally stupid. Researchers are not  automatons that blindly stab in the dark in the hope that they will find a “significant” effect. Usually scientists conduct research to test some hypothesis that is more or less reasonable. Even the most exploratory wild goose chases (and I have certainly heard of some) will make sense at some level. Thus the carefully concocted arguments about the terrible false discovery rates in research probably vastly underestimate the probability of that hypothesized effects actually exist and there is after all “reason to think that half the tests we do in the long run will have genuine effects.”

Naturally, it is hard to put concrete numbers on this. For some avenues of research it will no doubt be lower. Perhaps for many hypotheses tested by high-impact studies the probability may be fairly low, reflecting the high risk and surprise factor of these results. For drug trials the 10% figure may be close to the truth. For certain effects, such as those precognition or telepathy or homeopathy, I agree with Sam Schwarzkopf, Alex Holcombe, and David Colquoun (to name but a few) that the probability that they exist is extremely low. But my guess is that in many fields the probability ought to be better than a coin toss that hypothesized effects exist.

Wine

To cure science the Devil’s Neuroscientist prescribes a generous dose of this potion (produced at farms like this one in New Zealand)

Healthier science

I feel I have sufficiently argued that science isn’t actually sick so I don’t think we need to wreck our heads about possible means to cure it. However, this doesn’t imply we can’t do better. We can certainly aim to keep science healthy or make it even healthier.

So what is to be done? As I have already argued, I believe the most important step we should take is to encourage replication and a polite but critical scrutiny of scientific claims.  I also believe that at the root of most of the purported problems with science these days is the way we evaluate impact and how grants are allocated. Few people would say that the number of high impact publications on a resume tells us very much about how good a scientist a person is. Does anyone? I’m sure nobody truly believes that the number of downloads or views or media reports a study receives tells us anything about its contribution to science.

And yet I think we shouldn’t only value those scientists who conduct dry, incremental research. I don’t know what is a good measure of a researcher’s contribution on their field. Citations are not perfect but they are probably a good place to start. There probably is no good way other than hearsay and personal experience to really know how careful and skilled a particular scientist is in their work.

What I do know is that the replicability of one’s research and the correctness of one’s hypotheses alone aren’t a good measure. The most influential scientists can also be the ones who make some fundamental errors. And there are some brilliant scientists, whose knowledge is far greater than mine (or Sam’s) will ever be and whose meticulousness and attention-to-detail would put most of us to shame – but they can and do still have theories that will turn out to be incorrect.

If we follow down that dead end the Crusaders for True Science have laid out for us, if we trust only preregistered studies and put those who are fortuitous (or risk averse) enough to only do research that ends up being replicated on pedestals, in short, if we only regard “truth” in science, we will emphasize the wrong thing. Then science will really be sick and frail and it will die a slow, agonizing death.

¹ Proponents of preregistration keep reminding us that “nobody” suggests that preregistration should be mandatory or that it should be for all studies. These people I want to ask, what do you think will happen if preregistration becomes commonplace? How would you regard non-registered studies? What kinds of studies should not be preregistered?

² The Devil’s Neuroscientist recently discovered she is a woman but unlike other extra-dimensional entities the Devil’s Neuroscientist is not “whatever it wants to be.”

³ Does anyone still know what a record is? Or perhaps in this day and age they know again?

The Pipedream of Preregistration

(Disclaimer: As this blog is still new, I should reiterate that the opinions presented here are those of the Devil’s Neuroscientist, which do not necessarily overlap with those of my alter ago, Sam Schwarzkopf)

In recent years we have often heard that science is sick. Especially my own field, cognitive neuroscience and psychology, is apparently plagued by questionable research practices and publication bias. Our community abounds with claims that most scientific results are “false” due to lack of statistical power. We are told that “p-hacking” strategies are commonly used to explore the vast parameter space of their experiments and analyses in order to squeeze the last drop of statistical significance out of their data. And hushed (and sometimes quite loud) whispers in the hallways of our institutions, in journal club sessions, and at informal chats at conferences tell of many a high impact study that has repeated failed to be replicated but these failed replications vanish in the bottom of the proverbial file drawer.

Many brave souls have taken up the banner of fighting against this horrible state of affairs. There has been a whole spade of replication attempts of high impact research findings, the open access movement aims to make it more possible to published failed replications, and many proposals have been put forth to change the way we make statistical inferences from our data. These are all large topics in themselves and I will probably tackle them in later posts on this blog.

For my first post though I instead want to focus on the preregistration of experimental protocols. This is the proposal that all basic science projects should be preregistered publicly with an outline of the scientific question and the experimental procedures, including the analysis steps. The rationale behind this idea is that questionable research practices, or even just fairly innocent flexibility in procedures (“researcher degrees of freedom”) that could skew results and inflate false positives, will be more easily controlled. The preregistration idea has been making the rounds during the past few years and it is beginning to be implemented both in the forms of open repositories and some journals. In addition to fixing the validity of published research, preregistration is also meant as an assurance that failed replications are published because acceptance – and publication – of a study does not hinge on how strong and clean the results are but only on whether the protocol was sound and whether it was followed.

These all sound like very noble goals and there is precedent for such preregistration systems in clinical trials. So what is wrong with this notion? Why does this proposal make the Devil’s Neuroscientist anxious?

Well, I believe it is horribly misguided, that it cannot possibly work, and that – in the best-case scenario – it will make no difference to science-society’s ills. I think that well-intentioned as the preregistration idea may be, it actually results from people’s ever shortening 21st century attention spans because they can’t accept that science is a gradual and iterative process that takes decades, sometimes centuries, to converge on a solution.

Basic science isn’t clinical research

There is a world of difference between the aims of basic scientific exploration and clinical trials. I can get behind the idea that clinical tests, say of new drugs or treatments, ought to be conservative and minimizing the false positives in the results. Flexibility in the way data are collected and analyzed, how outliers are treated, how side effects are assessed, and so on, can seriously hamper the underlying goal: finding a treatment that actually works well.

I can even go so far to accept that a similarly strict standard ought to be applied to preclinical research, say animal drug tests. The Devil’s Neuroscientist may work for the Evil One but she is not without ethics. Any research that is meant to test the validity of an approach that can serve the greater good should probably be held to a strict standard.

However, this does not apply to basic research. Science is the quest to explain how the universe works. Exploration and tinkering is at the heart of this endeavor. In fact, I want to see more of this, not less. In my experience (or rather my alter ego’s experience – the Devil’s Neuroscientist is a mischievous demon possessing Sam’s mind and at the time of writing this she is only a day old – but they share the same memories) it is one of the major learning experiences most graduate students and postdocs go through to analyze your data to death.

By tweaking all the little parameters, turning on every dial, and looking at a problem from numerous angles we can get a handle on how robust and generalizable our findings truly are. In every student’s life sooner or later there comes a point where this behavior leads to them to the conclusion their “results are spurious and don’t really show anything of interest whatsoever.” I know, because I have been to that place (or at least my alter ego has).

This is not a bad thing. On the contrary I believe it is actually essential for good science. Truly solid results will survive even the worst data massaging to borrow a phrase Sam’s PhD supervisor used to say. It is crucial that researchers really know their data inside out. And it is important to understand the many ways an effect can be made to disappear, and conversely the ways data massaging can lead to “significant” effects that aren’t really there.

Now this last point underlines that that data massaging can indeed be used to create false or inflated research findings, and this is what people mean when they talk about researcher degrees of freedom or questionable research practices. You can keep collecting data, peeking at your significance level at every step, and then stop when you have a significant finding (this is known as “optional stopping” or “data peeking”). This approach will quite drastically inflate the false positives in a body of evidence and yet such practice may be common. And there may be (much) worse things out there, like the horror story someone (and I have reason to believe them) told me of a lab where the standard operating mode was to run a permutation analysis by iteratively excluding data points to find the most significant result. I neither know who these people were nor where this lab is. I also don’t know if this practice went on with or without the knowledge of the principal investigator. It is certainly not merely a “questionable” research practice but it has crossed the line into outright fraudulence. As the person who told me of this pointed out, if someone is clever enough to do this, it seems likely that they also know that this is wrong. The only difference from doing this and actually making up your data from thin air and eating all the experimental stimuli (candy) to cover up the evidence is that it actually uses real data – but it might as well not for all the validity we can expect from that.

But the key thing to remember here is that this is deep inside the realm of the unethical, certainly on some level of scientific hell (and no, even though I work for the Devil this doesn’t mean I wish to see anybody in scientific hell). Preregistration isn’t going to stop fraud. Cheaters gonna cheat. Yes, the better preregistration systems actually require the inclusion of a “lab log” and possibly all the acquired data as part of the completed study. But does anyone believe that this is really going to work to stop a fraudster? Someone who regularly makes use of a computer algorithm to produce the most significant result isn’t going to bat an eyelid at dropping a few data points from their lab log. What is a lab log anyway? Of course we keep records of our experiments, but unless we introduce some (probably infeasible) Orwellian scheme in which every single piece of data is recorded in a transparent, public way (and there have been such proposals), there is very little to stop a fraudster from forging the documentation for their finished study. And you know what, even in that Big Brother world of science a fraudster would find a way to commit fraud.

Most proponents of preregistration know this and are wont to point out that preregistration isn’t to stop outright fraud but questionable research practices – the data massaging that isn’t really fraudulent but still inflates false positives. These practices may stem from the pressure to publish high impact studies or may even simply be due to ignorance. I certainly believe that data peeking falls into the latter category or at least did before it was widely discussed. I think it is also common because it is intuitive. We want to know if a result is meaningful and collect sufficient data to be sure of it.

The best remedy against such things is to educate people of what is and isn’t acceptable. It also underlines the importance of deriving better analysis methods that are not susceptible to data peeking. There have been many calls to abandon null hypothesis significance testing altogether. This discussion may be the topic of another post by the Devil’s Neuroscientist in the future as there are a lot of myths about this point. However, at this point I certainly agree that we can do better and that there are ways – which may or may not be Bayesian – to use a stopping criterion to improve the validity of scientific findings.

Rather than mainly sticking to a preregistered script, I think we should encourage researchers to explore the robustness of their data by publishing more additional analyses. This is what supplementary materials can be good for. If you remove outliers in your main analysis, show what the result looks like without this step or at least include the data. If you have several different approaches, show them all. The reader can make up their own mind about whether a result is meaningful and their judgment should be all the more correct the more information there is.

Most importantly, people should replicate findings and publish those replications. This is a larger problem and in the past it had been difficult to publish replications. This situation has changed a lot however and replication attempts are now fairly common even in high profile journals. No finding should ever be regarded as solid until it has stood the test of time after repeated and independent replication. Preregistration isn’t going to help with that. Making it more rewarding and interesting to publish replications will. There are probably still issues to be resolved regarding replication although I don’t think the situation is as dire as it is often made out to be (and this will probably be the topic of another post in the future).

Like cats, scientists are naturally curious

Like cats, scientists are naturally curious and always keen to explore the world

Replications don’t require preregistration

The question of replication brings me to the next point. My alter ego has argued in the past that preregistration may be particularly suited for replication attempts. At the surface this seems logical because for a replication surely we want to stick as closely as possible to the original protocol so it is good to have that defined a priori.

While this is true, as I clawed my way out of the darkest reaches of my alter ego’s mind, it dawned on me that for replications the original experimental protocol is already published in the original study. All one needs to do is to follow the original protocol as closely as possible. Sure, in many publications the level of detail may be insufficient for a replication, which is in part due to the ridiculously low word limits of many journals.

However, the solution for this problem is not preregistration because preregistration doesn’t guarantee that the replication protocol is a close match to the original. Rather we must improve the level of detail of our methods sections. They are after all meant to permit replication. Fortunately, many journals that were particularly guilty of this problem have taken steps to change this. I don’t even mind if methods are largely published in online supplementary materials. A proper evaluation of a study requires close inspection of the methods but as long as they are easy to access I prefer detailed online methods to sparse methods sections inside published papers.

Proponents of preregistration also point out that having an experiment preregistered with a journal helps the publication of replication attempts because it ensures that the study will be published regardless of the outcome. This is perhaps true but at present there are not many journals that actually permit preregistration. It also forces the authors’ hands as to where to publish, which will be a turn off for many people.

Imagine that your finding fails to replicate a widely publicized, sensational result. Surely this will be far more interesting to the larger scientific community than if your findings confirm the previous study. Both are actually of equal importance and in truth having two studies investigating the same effect isn’t actually telling us much more about the actual effect than one study would. However, the authors may want to choose a high profile journal for one outcome but not for another (and the same applies to non-replication experiments). Similarly, the editors of a high impact journal will be more interested in one kind of result than another. My guess is PNAS was far keener to publish the failed replication of this study, than it would have been if the replication had confirmed the previous results.

While we still have the journal-based publication system or until we find another way for journals to decide where preregistered studies are published, preregistering studies with a journal forces the authors into a relationship with that journal. I predict that this is not going to appeal to many people.

We replicated the experiment two hours later, and found no evidence of this "sun in the sky". We conclude that the original finding was spurious.

We replicated the experiment two hours later, and found no evidence of this “sun in the sky” (BF01=10^100^100). We conclude that the original finding was spurious.

Ensuring the quality of preregistered protocols

Of course, we don’t need to preregister protocols with a journal but we could have a central repository where such protocols are uploaded. In fact, there is already such a place in the Open Science Framework, and, always keen to foster mainstream acceptance, a trial registry has also been set up for parapsychology. This approach would free the authors with regard to where the final study is published but this comes at a cost: at least at present these repositories do not formally review the scientific quality of the proposals. At least at a journal the editors will invite expert reviewers to assess the merits of the proposed research and suggest possible changes to be implemented before the data collection even begins.

Theoretically this is also possible at a centralized repository, but this is currently not done. It would also cause a major burden on the peer review system. There already are an enormous number of research manuscripts out there that want to be reviewed by someone (just ask the editors at the Frontiers journals). This number would probably inflate that workload massively because it is substantially easier to draft a simple design document for an experiment than it is to write up a fully-fledged study with results, analysis, and interpretation. Incidentally, this is yet another way in which clinical trials differ from basic research: in clinical trials you presumably already have a treatment or drug whose efficacy we want to assess. This limits the number of trials somewhat at least. In basic research all bets are off – you can have as many ideas for experiments as your imagination permits.

So what we will be left with is lots of preregistered experimental protocols that are either reviewed shoddily or not at all. Sam recently reviewed an EEG study claiming to have found neural correlates of telepathy. All the reviews are public so everyone can read them. The authors of this study actually preregistered their experimental protocol at the Open Science Framework. The protocol was an almost verbatim copy of an earlier pilot study the authors had done, plus some minor changes that were added in the hope to improve the paradigm. However, the level of detail in the methods was so sparse and obscure that it made assessment of the research, let alone an actual replication attempt, nigh impossible. There were also some fundamental flaws in the analysis approach that indicated that the entire results, including those from the pilot experiment, were purely artifactual. In other words, the protocol might as well not have been preregistered.

More recently, another study used preregistration (again without formal review) to conduct a replication attempt of a series of structural brain-behavior experiments. You can find a balanced and informative summary of the findings at this blog and at the bottom you will find an extensive discussion, including comments by my alter ego. What these authors did was to preregister the experimental protocol by uploading it to the webpage of one of the authors. They also sent the protocol to the authors of the original studies they wanted to replicate to seek their feedback. Minimal (or lack of) response was then assumed to be tacit agreement that the protocol was appropriate.

The scientific issues of this discussion are outside the scope of this post. Briefly, it turns out that at least for some of the experiments in this study the methods are only a modest match with those of the original studies. This should be fairly clear from reading the original methods sections. Whether or not the original authors “agreed” with the replication protocols (and it remains opaque just what that means exactly), there is already a clear departure from the actually predefined script at the very outset.

It is of course true that a finding should be generalizable to be of importance and robustness to certain minor variations in the approach should be part of that. For example, it was recently argued by Daryl Bem that failure to replicate his precognition results was due to the fact that the replicators did not use his own stimulus software for the replication. This is not a defensible argument because to my knowledge the replication followed the methods outlined in the original study fairly closely. Again, this is what methods sections are for. Therefore, if the effect can really only be revealed by using the original stimulus software this at the very least suggests that it doesn’t generalize. Before drawing any further conclusions it is therefore imperative to understand where the difference lies. It could certainly be that the original software is somehow better at revealing the effect, but it could also mean that it has a hidden flaw resulting in an artifact.

The same isn’t necessarily true in the case of these brain-behavior correlations. It could be true, but at present we have no way of knowing it. The methods of the original study as published in the literature weren’t adhered to, so it is incorrect to even call this a direct replication. Some of the discrepancies could very well be the reason why the effect disappears in the replication or conversely could also introduce a spurious effect in the original studies.

This is where we come back to preregistration. One of the original authors was actually also a reviewer of this replication study of these brain-behavior correlations. His comments are also included on that blog and he further elaborates on his reviews in the discussion on that page. He suggests that he proposed additional analyses to the replicators that are actually a closer match to the original methods but they refused to conduct these analyses because they are exploratory and thus weren’t part of the preregistered protocol. However, this is odd because several additional exploratory analyses are included (and clearly labeled as such) in the replication study. Moreover, the original author suggests he conducted the analysis he suggested on the replication data and that this in fact confirms the original findings. Indeed, another independent successful replication of one of these findings was published but not taken into account by this replication. As such it seems odd that only some of the exploratory methods are included in this replication and some more cynical than the Devil’s Neuroscientist (who is already pretty damn cynical) might call that cherry picking.

What this example illustrates is that it is actually not very straightforward to evaluate and ideally improve a preregistered protocol. First of all, sending out the protocol to (some) of the original authors is not the same as obtaining solid agreement that the methods are appropriate. Moreover, not taking suggestions on board at a later stage, when data collection/analysis has already commenced, is hindering good science. Take again my earlier example of the EEG study Sam reviewed. In his evaluation the preregistered protocol was fundamentally flawed, resulting in completely spurious findings. The authors revised their manuscript and performed different analyses Sam suggested, which essentially confirmed that there was no evidence of telepathy (although the authors never quite got to the point of conceding this).

Now unlike the Devil’s Neuroscientist, Sam is merely a fallible human and any expert can make mistakes. So you should probably take his opinion with a grain of salt. However, I believe in this case his view of that experiment is entirely correct (but then again I’m biased). What this means is that under a properly adhered to preregistration system the authors would have to perform the preregistered procedures even though they are completely inadequate. Any improvements, no matter how essential, will have to be presented as “additional exploratory procedures”.

This is perhaps a fairly extreme case but not unrealistic. There may be many situations where a collaborator or a reviewer (assuming the preregistered protocol is public) will suggest an improvement over the procedures after data collection has started. In fact, if a reviewer makes this suggestion then I would regard this as unproblematic to alter the design post-hoc. After all, preregistration is supposed to stop people from data massaging, not from making independently advised improvements to the methods. Certainly, I would rather see well designed, meticulously executed studies that have a level of flexibility, than a preregistered protocol that is deeply flawed.

The thin red line

Before I conclude this (rather long) first post to my blog, I want to discuss another aspect of preregistration that I predict will be its largest problem. As many proponents of preregistration do not tire to stress, preregistration does not preclude exploratory analyses or even whole exploratory studies. However, as I discussed in the previous sections, there are actually complications with this. My prediction is that almost all studies will contain a large amount of exploration. In fact, I am confident that the best studies will contain the most exploration because – as I wrote earlier – thorough exploration of the data is actually natural, it is useful, and it should be encouraged.

For some studies it may be possible to predict many of the different angles from which to analyze the data. It may also be acceptable to modify small mistakes in the preregistered design post-hoc and clearly label these changes. However, by and large what I believe will happen is that we will have a large number of preregistered protocols but that the final publications will in fact contain a lot of additional exploration, and that some of the scientifically most interesting information will usually be in there.

How often have you carried out an experiment only to find that your beautiful hypotheses aren’t confirmed, that the clear predictions you made do not pan out? These are usually the most interesting results because they entice you to dig deeper, to explore your data, and generate new, better, more exciting hypotheses. Of course, none of this is prevented by preregistration but in the end the preregistered science is likely to be the least interesting part of the literature.

But there is also another alternative. Perhaps we are enforcing preregistration more strictly. Perhaps only a very modest amount of exploration will be permitted after all. Maybe preregistered protocols will have to contain every detailed step in your methods, including the metal screening procedure for MRI experiments, exact measurements of the ambient light level, temperature, and humidity in the behavioral testing room, and the exact words each experimenter will say to each participant before, during, and after the actual experiment, without any room for improvisation (or, dare I say, natural human interaction). It may be that only preregistered studies that closely follow their protocol will be regarded as good science.

This alternative strikes me as a nightmare scenario. Not only will this stifle creativity and slow the already gradual progress of science down to a glacial pace, it will also rob science of the sense of wonder that attracted many of us to this underpaid job in the first place.

The road to hell is paved with good intentions – Wait. Does this mean I should be in favor of it?

Opening Statement

This marks the beginning of my blog defending the indefensible in the field of neuroscience, both in terms of actually scientific discourse as well as discussions about the future of our field. You can read my mission statement here. In general, I will argue against the prevailing view, not always because I don’t believe in it but in order to reveal holes in reasoning and logical fallacies inherent to these ideas. I will do this out of a general joy for disagreeing with the majority but also because I believe that to be a scientist is to be a skeptic. Therefore instead of being part of the echo chamber of public opinion and keep patting ourselves on the backs over how great our thoughts are, I believe our ideas (and ideals) need to stand up to higher scrutiny.

My first post will probably deal with the proposal to have preregistration of experimental protocols for basic research. Following that, I will scrutinize the scrutiny with which certain topics in (neuro-)science have been met, whilst others are largely ignored. Stay tuned…

This picture has nothing whatsoever to do with this blog

This picture has nothing whatsoever to do with this blog