Christmas break

It is Christmas now and my alter ego, Sam, will be strong during this time so I won’t be able to possess his mind. And to be honest even demonic scientists need vacations. I will probably get in trouble with the boss about taking a Christmas vacation but so be it.

I won’t be able to check this blog (not often anyway) and probably won’t approve new commenters during the break etc. Discussions may continue in 2015!

Advertisements

Of self-correction and selfless errors

I had originally planned to discuss today’s topic at a later point, perhaps as part of my upcoming post about the myths of replication. However, discussions surrounding my previous posts, as well as the on-going focus on the use of posthoc power analysis in the literature, led me to the decision to address this point now.

A central theme of my previous two posts was the notion that science is self-correcting and that replication and skepticism are the best tools at our disposal. Discussions my alter ego Sam has had with colleagues, as well as discussions in the comment section on this blog and elsewhere, reveal that many from the ranks of the Crusaders for True Science call that notion into question. In that context, I would like to thank a very “confused” commenter on my blog for referring me to an article I hadn’t read, which is literally entitled “Why science is not necessarily self-correcting?“. I would also like to thank Greg Francis who commented on his application of statistical tests to detect possible publication bias in the literature. Recently I also became aware of another statistical procedure based on assumptions about statistical power called the Replication Index that was proposed as an alternative to the Test of Excess Significance used by Francis. I think these people are genuinely motivated by a selfless desire to improve the current state of science. This is a noble goal but I think it is fraught with errors and some potentially quite dangerous misunderstandings.

The Errors of Meta-Science

I will start with the statistical procedures for detecting publication bias and the assertion that most scientific findings are false positives. I call this entire endeavor “meta-science” because this underlines the fundamental problem with this whole discussion. As I pointed out in my previous post, science is always wrong. It operates like a model-fitting procedure to gradually improve the explanatory and predictive value of our attempts to understand a complex universe. The point that people are missing in this entire debate about asserted false positive rates, non-reproducibility, and publication bias is that the methods used to make these assertions are themselves science. Thus these procedures suffer from the same problems as any scientific effort in that they seek to approximate the truth but can never actually hope to reach it. By using scientific methods to understand the workings of the scientific method, the entire logic of this approach is circular.

Circular inference has recently received a bit of attention within neuroscience. I don’t know if the authors of this paper actually coined the term “voodoo correlations”. Perhaps they merely popularized it. The same logical fallacy has also been called “double-dipping“. However, all this is really simply circular reasoning and somewhat related to “begging the question“. It is a more a problem with flawed logic than with science. Essentially, it is what happens when you use the same measurements to test the validity of your predictions as you did for making the predictions in the first place.

This logical fallacy can result in serious errors. However, in the real world this isn’t actually entirely avoidable and it isn’t always problematic as long as we are aware of its presence. For instance, a point most people are missing is that whenever they report something like t(31)=5.2, p<0.001, or a goodness-of-fit statistic, etc, they are using circular inference. You are reporting an estimate of the effect size (be it a t-statistic, the goodness-of-fit, Cohen’s d or others) based on the observed data and then draw some sort of general conclusion from it. The goodness of a curve fit is literally calculated using the accuracy with which the model predicts the observed data. Just observing an effect size, say, a difference in some cognitive measure between males and females, can only tell you that this difference exists in your sample. You can make some probabilistic inferences about how this observed effect may generalize to the larger population and this is what statistical procedures do – however, in truth you cannot know what an effect means for the general population until you checked your predictions through empirical observations.

There are ways to get out of this dilemma, for example through cross-validation procedures. I believe this should be encouraged, especially whenever a claim about the predictive value of a hypothesis is made. More generally, replication attempts are of course a way to test predictions from previous results. Again, we should probably encourage more of that and cross-validation and replication can ideally be combined. Nevertheless, the somewhat circular nature of reporting observed effect sizes isn’t necessarily a major problem provided we keep in mind what an effect size estimate can tell us and what it can’t.

The same applies to the tests employed by meta-science. These procedures take an effect size estimate from the scientific literature, calculate the probability that this effect could have occurred under the conditions (statistical power), and then make inferences on this posthoc probability. The assumptions on which these procedures are based remain entirely untested. In so far that they make predictions at all, such as whether the effect is likely to be replicated in future experiments, no effort is typically made to test them. Statistical probabilities are not a sufficient replacement for empirical tests. You can show me careful, mathematically coherent arguments as to why some probability should be such and such – if the equation is based on flawed assumptions and/or it doesn’t take into account some confounds, the resulting conclusions may be untenable. This doesn’t necessarily mean that the procedure is worthless. It is simply like all other science. It constructs an explanation for the chaotic world out there that may or may not be adequate. It can never be a perfect explanation and we should not treat it as if it were the unadulterated truth. This is really my main gripe with meta-science: Proponents of these meta-science procedures treat them as if they were unshakable fact and the conviction with which some promote these methods borders on religious zeal.

One example is the assertion that a lot of scientific findings are false positives. This argument is based on the premise that many published experiments are underpowered and that publication bias (which we know exists because research is actively seeking positive results) means that mainly positive findings are reported. In turn this may explain what some have called the “Decline Effect“, that is, initial effect size estimates are inflated and they gradually decrease and approach the true effect size as more and more data are collected.

I don’t deny that lack of power and publication bias probably exist. However, I also think that current explanations are insufficient for explaining all the data. There are other reasons that may cause a reduction of effect size estimates as time goes on and more and more attempts at replication are made. Few of them are ever formally taken into account by meta-science models, partly because they are notoriously difficult to quantify. For instance, there is the question of whether all experiments are comparable. Even with identical procedures carried out by meticulous researchers, the quality of reagents, of experimental subjects, or generally the data we measure can differ markedly. I think this may be a particularly bad problem in psychology and cognitive neuroscience research although it probably exists in many areas of science. I call this the Data Quality Decay Function:

SubjectQualityTake for example the reliability and quality of data that can be expected from research subjects. In the early days after the experiment was conceived we test people drawn from subject pools of reliable research subjects. If it is a psychophysical study on visual perception, chances are that the subjects are authors on the paper or at least colleagues who have considerable experience with doing experiments. The data reaped from such subjects will be clean, low noise estimates of the true effect size. The cynical might call these kinds of subjects “representative” and possibly even “naive”, provided they didn’t co-author the paper at least.

As you add more and more subjects the recruitment pool will inevitably widen. At first there will be motivated individuals who don’t mind sitting in dark rooms staring madly at a tiny dot while images are flashed up in their peripheral vision. Even though the typical laboratory conditions of experiments are a long way from normal everyday behavior, people like this will be engaged and willing enough to perform reasonably on the experiment but they may fatigue more easily than trained lab members. Moreover, sooner or later your subject pool will encompass subjects with low motivation. There will be those who only take part in your study because of the money they get paid or (even worse) because they are coerced by course credit requirements. There may even be professional subjects who participate in several experiments within the same day. You can try to control this but it won’t be fool-proof, because in the end you’ll have to take them by their word even if you ask them whether they have participated in other experiments. And to be honest, professional subjects may be more reliable than inexperienced ones so it can be worthwhile to test them. Also, frequently you just don’t have the luxury to turn away subjects. I don’t know about your department but people weren’t exactly kicking in the doors of any lab I have seen just to participate in experiments.

Eventually, you will end up testing random folk off the street. This is what you will want if you are actually interested in generalizing your findings to the human condition. Ideally, you will test the effect in a large, diverse, multicultural, multiethnic, multiracial sample that encompasses the full variance of our species (this very rarely happens). You may even try to relax the strict laboratory conditions of earlier studies. In fact you’ll probably be forced to because Mongolian nomads or Amazonian tribeswomen, or whoever else your subject population may be, just don’t tend to hang around psychology departments in Western cities. The effect size estimate under these conditions will almost inevitably be smaller than those in the original experiments because of the reduced signal-to-noise ratio. Even if the true biological effect is constant across humanity, the variance will be greater.

This last point highlights why it isn’t so straightforward to say “But I want my findings to generalize so the later estimate reflects the truth more accurately”. It really depends on what your research question is. If you want to measure an effect and make general predictions as to what this means for other human beings, then yes, you should test as wide a sample as possible and understand why any meaningful effect size is likely to be small. Say, for instance, we are testing the efficacy of a new drug and characterize its adverse effects. Such experiments should be carried out on a wide-ranging sample to understand how differences between populations or individual background can account for side effects and whether the drug is even effective. You shouldn’t test a drug only on White Finnish men only to find out later that it is wholly useless or even positively dangerous in Black Carribean women. This is not just a silly example – this sort of sampling bias can be a serious concern.

On the other hand, when you are testing a basic function of the perceptual system in the human brain, testing a broad section of the human species is probably not the wisest course of action. My confidence in psychophysical results produced by experienced observers, even if they are certifiably non-naive and anything-but-blind to the purpose of the experiment (say, because they are the lead author of the paper and coded the experimental protocol), can still be far greater than it would be for the same measurements from individuals recruited from the general population. There are myriad factors influencing the latter that are much more tightly controlled in the former. Apart from issues with fatigue and practice with the experimental setting, they also may simply not really know what to look for. If you cannot produce an accurate report of your perceptual experience, you aren’t going to produce an accurate measurement of it.

Now this is one specific example and it obviously does not have to apply to all cases. I am pretty confident that the Data Quality Decay Function exists. It occurs for research subjects but it could also relate to reagents that are being reused, small errors in a protocol that accumulate over time, and many other factors. In many situations the slope of the curve may be very shallow so that the decay is really non-existent. There are also likely to be other factors that may counteract and, in some cases, invert the function. For instance, if the follow-up experiments actually improve the methodology of an experiment the data quality might even be enhanced. This is certainly the hope we have for science in general – but this development may take a very long time.

The point is, we don’t really know very much about anything. We don’t know how data quality, and thus effect size estimates, vary with time, across samples, between different experimenters, and so forth. What we do know is that under the most common assumptions (e.g. Gaussian errors of equal magnitude across groups) the sample sizes we can realistically use are insufficient for reliable effect size estimates. The main implication of the Data Quality Decay Function is that the effect size estimates under standard assumptions are probably smaller than the true effect.

While I am quite a stubborn lady, as I said earlier, I am not so stubborn to think this is the sole explanation. We know publication bias exists and so it is almost inevitable that it affects effect sizes in the literature. I also think that even if some of the procedures used to infer false positive rates and publication bias are based on untested assumptions and on logically flawed posthoc probabilities, they reveal some truths. All meta-science is wrong – but that doesn’t make it wholly worthless. I just believe we should take it with a grain of salt and treat it like all other science. In the long run, meta-science will self-correct.

LocalMinimum

Sometimes when you’re in a local minimum the view is just better

Self-correction is a fact

This brings me to the other point of today’s post, the claim that self-correction in science is a myth. I argue that self-correction is inherent to the scientific process itself. All the arguments against self-correction I have heard are based on another logical fallacy. People may say that the damn long time it took the scientific community to move beyond errors like phrenology or racial theories demonstrates that science does not by itself correct its mistakes. They suggest that because particular tools, e.g. peer review or replication, failed to detect serious errors or even fraudulent results, means that science itself does not weed out such issues.

The logical flaw here is that all of these things are effectively point estimates from a slow time series. It is the same misconception as to why people deny that global temperatures are rising because we have had some particularly cold winters in some specific years in fairly specific corners of the Earth or the error that leads creationists to claim that evolution has never been observed directly. It is why previous generations of scientists found it so hard to accept the thought that the surface of the Earth comprises tectonic plates floating on liquid magma. Fortunately, science has already self-corrected that latter misconception and seismology and tectonics are widely accepted theories well beyond the scientific community. Sadly, evolution and climate change have not arrived at the same level of mainstream acceptance.

It seems somewhat ironic that we as scientists should find it so difficult to understand that science is a gradual, slow process. After all we are all aware of evolutionary, geological, and astronomical time scales. However, in the end scientists are human and thus subject to the same perceptual limits and cognitive illusions as the rest of our species. We may get a bit of an advantage compared to other people who simply never need to think about similar spatial and temporal dimensions. But in the end, our minds aren’t any better equipped to fathom the enormity and age of the cosmos than anybody else’s.

Science is self-correcting because that is what science does. It is the constant drive to seek better answers to the same questions and to derive new questions that can provide even better answers. If the old paradigms are no longer satisfactory, they are abandoned. It can and it does happen all the time. Massive paradigm shifts may not be very frequent but that doesn’t mean they don’t happen. As I said last time, science does of course make mistakes and these mistakes can prevail for centuries. Using again my model-fitting analogy one would say that the algorithm gets stuck in a “local minimum“. It can take a lot of energy to get out of that but given enough time and resources it will happen. It could be a bright spark of genius that overthrows accepted theories. It could be that the explanatory construct of the status quo becomes so overloaded that it collapses like a house of cards. Or sometimes it may simply be a new technology or method that allows us to see things more clearly than before. Sometimes dogmatic, political, religious, or other social pressure can delay progress, for example, for a long time your hope of being taken seriously as a woman scientist was practically nil. In that case, what it takes to move science forward may be some fundamental change to our whole society.

Either way, bemoaning the fact that replication and skeptical scrutiny haven’t solved all problems and managed to rectify every erroneous assumption and refute every false result is utterly pointless. Sure, we can take steps to ensure that the number of false positives is reduced but don’t go so far to make it close to impossible to detect new important results. Don’t make the importance of a finding dependent on it being replicated hundreds of times first. We need replication for results to stand the test of time but scientists will always try to replicate potentially important findings. If nobody can be bothered to replicate something, it may not be all that useful – at least at the time. Chances are that in 50 or 100 or 1000 years the result will be rediscovered and prove to be critical and then our descendants will be glad that we published it.

By all means, change the way scientists are evaluated and how grants are awarded. I’ve said it before but I’ll happily repeat it. Immediate impact should not be the only yardstick by which to measure science. Writing grant proposals as catalogs of hypotheses when some of the work is inevitably exploratory in nature seems misguided to me. And I am certainly not opposed to improving our statistical practice, ensuring higher powered experiments, and encouraging strategies for more replication and cross-validation approaches.

However, the best and most important thing we can do to strengthen the self-correcting forces of science is to increase funding for research, to fight dogma wherever it may fester, and to train more critical and creative thinkers.

“In science it often happens that scientists say, ‘You know that’s a really good argument; my position is mistaken,’ and then they would actually change their minds and you never hear that old view from them again. They really do it. It doesn’t happen as often as it should, because scientists are human and change is sometimes painful. But it happens every day. I cannot recall the last time something like that happened in politics or religion.”

Carl Sagan

Why all research findings are false

(Disclaimer: For those who have not seen this blog before, I must again point out that the views expressed here are those of the demonic Devil’s Neuroscientist, not those of the poor hapless Sam Schwarzkopf whose body I am possessing. We may occasionally agree on some things but we disagree on many more. So if you disagree with me feel free to discuss with me on this blog but please leave him alone)

In my previous post I discussed the proposal that all¹ research studies should be preregistered. This is perhaps one of the most tumultuous ideas that are being pushed as a remedy for what ails modern science. There are of course others, such as the push for “open science”, that is, demands for free access to all publications, transparent post-publication review, and sharing of all data collected for experiments. This debate has even become entangled with age-old faith wars about statistical schools of thought. Some of these ideas (like preregistration or whether reviews should be anonymous) remain controversial and polarizing, while others (like open access to studies) are so contagious that they have become almost universally accepted up to the point that disagreeing with such well-meaning notions makes you feel like you have the plague. On this blog I will probably discuss each of these ideas at some point. However, today I want to talk about a more general point that I find ultimately more important because this entire debate is just a symptom of a larger misconception:

Science is not sick. It never has been. Science is how we can reveal the secrets of the universe. It is a slow, iterative, arduous process. It makes mistakes but it is self-correcting. That doesn’t mean that the mistakes don’t sometimes stick around for centuries. Sometimes it takes new technologies, discoveries, or theories (all of which are of course themselves part of science) to make progress. Fundamental laws of nature will perhaps keep us from ever discovering certain things, say, what happens when you approach the speed of light, leaving them for theoretical consideration only. But however severe the errors, provided our species doesn’t become extinct through cataclysmic cosmic events or self-inflicted destruction, science has the potential to correct them.

Also science never proves anything. You may read in the popular media about how scientists “discovered” this or that, how they’ve shown certain things, or how certain things we believe turn out to be untrue. But this is just common parlance for describing what scientists actually do: they formulate hypotheses, try to test them by experiments, interpret their observations, and use them to come up with better hypotheses. Actually, and quite relevant to the discussion about preregistration, this process frequently doesn’t start with the formulation of hypotheses but with making chance observations. So a more succinct description of a scientist’s work is this: we observe the world and try to explain it.

Science as model fitting

In essence, science is just a model-fitting algorithm. It starts with noisy, seemingly chaotic observations (the black dots in the figures below) and it attempts to come up with a model that can explain how these observations came about (the solid curves). A good model can then make predictions as to how future observations will turn out. The numbers above the three panels in this figure indicates the goodness-of-fit, that is, how good an explanation the model is for the observed data. Numbers closer to 1 denote better model fits.

CurveFitting

It should be immediately clear that the model in the right panel is a much better description of the relationship between data points on the two axes than the other panels. However, it is also a lot more complex. In many ways, the simple lines in the left or middle panel are much better models because they will allow us to make predictions that are far more likely to be accurate. In contrast, for the model in the right panel, we can’t even say what the curve will look like if we move beyond 30 on the horizontal axis.

One of the key principles in the scientific method is the principle of parsimony, also often called Occam’s Razor. It basically states that whenever you have several possible explanations for something, the simplest one is probably correct (it doesn’t really say it that way but that’s the folk version and it serves us just fine here). Of course we should weigh the simplicity of an explanation against it’s explanatory or predictive power. The goodness-of-fit of the middle panel is better than that of the left panel, although not by much. Nevertheless, it isn’t that much more complex than the simple linear relationship shown in the left panel. So we could perhaps accept the middle panel as our best explanation – for now.

The truth though is that we can never be sure what the true underlying explanation is. We can only collect more data and see how well our currently favored models do in predicting them. Sooner or later we will find that one of the models is just doing better than all the others. In the figure below the models fitted to the previous observations are shown as red curves while the black dots are new observations. It should have become quite obvious that the complex model in the right panel is a poor explanation for the data. The goodness-of-fit on these new observations for this model is now much poorer than for the other two. This is because this complex model was actually overfitting the data. It tried to come up with the best possible explanation for every observation instead of weighing explanatory power against simplicity. This is probably kind of what is going on in the heads of conspiracy theorists. It is the attempt to make sense of a chaotic world without taking a step back to think whether there might not be simpler explanations and whether our theory can make testable predictions. However, as extreme as this case may look, scientists are not immune from making such errors either. Scientists are after all human.

CrossValidation

I will end the model fitting analogy here. Suffice it to say that with sufficient data it should become clear that the curve in the middle panel is the best-fitting of the three options. However, it is actually also wrong. Not only is the function used to model the data not the one that was actually used to generate the observations, but the model also cannot really predict the noise, the random variability spoiling our otherwise beautiful predictions. Even in the best-fitting case the noise prevents us from predicting future observations perfectly. The ideal model would not only need to describe the relationship between data points on the horizontal and vertical axes but it would have to be able to predict that random fluctuation added on top of it. This is unfeasible and presumably impossible without a perfect knowledge of the state of everything in the universe from the nanoscopic to the astronomical scale. If we tried this it will most likely look like the overfitted example in the right panels. Therefore this unexplainable variance will always remain in any scientific finding.

Horizon

A scientist will keep swimming to find what lies beyond that horizon

Science is always wrong

This analogy highlights why the fear of incorrect conclusions and false positives that has germinated in recent scientific discourse is irrational and misguided. I may have many crises but reproducibility isn’t one of them. Science is always wrong. It is doomed to always chase a deeper truth without any hope of ever reaching it. This may sound bleak but it truly isn’t. Being wrong is inherent to the process. This is what makes science exciting. These ventures into the unknown drives most scientists, which is why many of us actually like the thought of getting up in the morning and going to work, why we stay late in the evening trying solve problems instead of doing something more immediately meaningful, and why we put up with pitifully low salaries compared to our former classmates who ended up getting “real jobs”. It is also the same daring and curiosity that drove our ancestors to invent tools, discover fire, and to cross unforgiving oceans in tiny boats made out of tree trunks. Science is an example of the highest endeavors the human spirit is capable of (it is not the only one but this topic is outside the scope of this blog). If I wanted unwavering certainty that I know the truth of the world, I’d have become a religious leader, not a scientist.

Now one of the self-declared healers of our “ailing” science will doubtless interject that nobody disagrees with me on this, that I am just being philosophical, or playing with semantics. Shouldn’t we guarantee, or so they will argue, that research findings are as accurate and true as they can possibly be? Surely, the fact that many primary studies, in particular those in high profile journals, are notoriously underpowered is cause for concern? Isn’t publication bias, the fact that mostly significant findings are published while null findings are not, the biggest problem for the scientific community? It basically means that we can’t trust a body of evidence because even in the best-case scenario the strength of evidence is probably inflated.

The Devil’s Neuroscientist may be evil and stubborn but she² isn’t entirely ignorant. I am not denying that some of the issues are problematic. But fortunately the scientific method already comes with a natural resistance, if not a perfect immunity, against these issues: skepticism and replication. Scientists use them all the time. For those people who have not quite managed to wrap their heads around the fact that I am not my alter ego, Sam Schwarzkopf, will say that I am sounding like a broken record³. While Sam and my humble self don’t see eye to eye on everything we probably agree on these points as he has repeatedly written about this in recent months. So as a servant of the devil, perhaps I sound like a demonic Beatles record: noitacilper dna msicitpeks.

There are a lot of myths about replication and reproducibility and I will write an in-depth post about that at a future point. Briefly though let me stress that, evil as I may be, I believe that replication is a corner stone of scientific research. Replication is the most trustworthy test for any scientific claim. If a result is irreplicable, perhaps because the experiment was just a once-in-an-age opportunity, because it would be too expensive to do twice, or for whatever other reason, then it may be interesting but it is barely more than an anecdote. At the very least we should expect pretty compelling evidence for any claims made about it.

Luckily, for most scientific discoveries this is not the case. We have the liberty and the resources to repeat experiments, with or without systematic changes, to understand the factors that govern them. We should and can replicate our own findings. We can and should replicate other people’s findings. The more we do of this the better. This doesn’t mean we need to go on a big replication rampage like the “Many Labs” projects. Not that I have anything against this sort of thing if people want to spend their time in this way. I think for a lot of results this is probably a waste of time and resources. Rather I believe we should encourage a natural climate of replication and I think it already exists although it can be enhanced. But as I said, I will specifically discuss replication in a future post so I will leave this here.

Instead let me focus on the other defense we have at our disposal. Skepticism is our best weapon against fluke results. You should never take anything you read in a scientific study at face value. If there is one thing every scientist should learn it is this. In writing scientific results look more convincing and “cleaner” than they are when you’re in the middle of experiments and data analysis. And even for those (rare?) studies with striking data, insurmountable statistics, and the most compelling intellectual arguments you should always ask “Could there be any other explanation for this?” and “What hypothesis does this finding actually disprove?” The latter question underlines a crucial point. While I said that science never proves anything, it does disprove things all the time. This is what we should be doing more of and we should probably start with our own work. Certainly, if a hypothesis isn’t falsifiable it is pretty meaningless to science. Perhaps a more realistic approach was advocated by Platt in his essay “Strong Inference“. Instead of testing whether one hypothesis is true we should pit two or more competing hypotheses against each other. In psychology and neuroscience research this is actually not always easy to do. Yet in my mind it is precisely the approach that some of the best studies in our field take. Doing this immunizes you from the infectiousness of dogmatic thinking because you no longer feel the need to prove your little pet theory and you don’t run control experiments simply to rule out trivial alternatives. But admittedly this is often very difficult because typically one of the hypotheses is probably more exciting…

The point is, we should foster a climate of where replication and skepticism are commonplace. We need to teach self-critical thinking and reward it. We should encourage adversarial collaborative replication efforts and the use of multiple hypotheses wherever possible. Above all we need to make people understand that criticism in science is not a bad thing but essential. Perhaps part of this involves training some basic people skills. It should be possible to display healthy, constructive skepticism without being rude and aggressive. Most people have stories to tell of offensive and irritating colleagues and science feuds. However, at least in my alter ego’s experience, most scientific disagreements are actually polite and constructive. Of course there are always exceptions: reviewer 2 we should probably just shoot into outer space.

What we should not do is listen to some delusional proposals about how to evaluate individual researchers, or even larger communities, by the replicability and other assessments of the truthiness of their results. Scientists must accept that we are ourselves mostly wrong about everything. Sometimes the biggest impact, in as far as that can be quantified, is not made by the person who finds the “truest” finding but by whoever lays the groundwork for future researchers. Even a completely erroneous theory can give some bright mind the inspiration for a better one. And even the brightest minds go down the garden path sometimes. Johannes Kepler searched for a beautiful geometry of the motion of celestial bodies that simply doesn’t exist. That doesn’t make it worthless as his work was instrumental for future researchers. Isaac Newton wasted years of his life dabbling in alchemy. And even on the things he got “right”, describing the laws governing motion and gravity, he was also really kind of wrong because his laws only describe a special case. Does anyone truly believe that these guys didn’t make fundamental contributions to science regardless of what they may have erred on?

We hope our pilot experiments won't all crash and burn

May all your pilot experiments soar over the clouds like this, not crash and burn in misery

Improbability theory

Before I will leave you all in peace (until the next post anyway), I want to make some remarks about some of the more concrete warnings about the state of research in our field. A lot of words are oozing out of the orifices in certain corners about the epidemic of underpowered studies and the associated spread of false positives in the scientific literature. Some people put real effort into applying statistical procedures to whole hosts of published results to reveal the existence of publication bias or “questionable research practices”. The logic behind these tests is that the aggregated power over a series of experiments makes it very improbable that statistically significant effects could be found in all of them. Apparently, this test flags up an overwhelming proportion of studies in some journals as questionable.

I fail to see the point of this. First of all, what good will come from naming and shaming studies/researchers who apparently engaged in some dubious data massaging, especially when, as we are often told, these problems are wide-spread? One major assertion that is then typically made is that the researchers ran more experiments than they reported in the publication but that they chose to withhold the non-significant results. While I have no doubt that this does in fact happen occasionally, I believe it is actually pretty rare. Perhaps it is because Sam, whose experiences I share, works in neuroimaging where it would be pretty damn expensive (both in terms of money and time investment) to run lots of experiments and only publishing the significant or interesting ones. Then again, he certainly has heard of published fMRI studies where a whopping number of subjects were excluded for no good reason. So some of that probably does exist. However, he was also trained by his mentors to believe that all properly executed science should be published and this is the philosophy by which he is trying to conduct his own research. So unless he is somehow rare in this or behavioral/social psychology research (about which claims of publication bias are made most often) are for some reason much worse than other fields, I don’t think unreported experiments are an enormous problem.

What instead might cause “publication bias” is the tinkering that people sometimes do in order to optimize their experiments and/or maximize the effects they want to measure. This process is typically referred to as “piloting” (not sure why really – what does this have to do with flying a plane?). It is again highly relevant to our previous discussion or preregistration. This is perhaps the point where preregistration of an experimental protocol might have its use: First do lots of tinker-explore-piloting to optimize the ways to address an experimental question. Then preregister this optimized protocol to do a real study to answer the question but strictly follow the protocol. Of course, as I have argued last time, instead you could just publish the tinkered experiments and then you or someone else can try to replicate using the previously published protocol. If you want to preregister those efforts, be my guest. I am just not convinced it is necessary or even particularly helpful.

Thus part of the natural scientific process will inevitably lead to what appears like publication bias. I think this is still pretty rare in neuroimaging studies at least. Another nugget of wisdom about imaging Sam has learned from his teachers, and which is he is trying to impart on his own students, is that in neuroimaging you can’t just constantly fiddle with your experimental paradigm. If you do so you will not only run out of money pretty quickly but also end up with lots of useless data that cannot be combined in any meaningful way. Again, I am sure some of these things happen (maybe some people are just really unscrupulous about combining data that really don’t belong together) but I doubt that this is extremely common.

So perhaps the most likely inflation of effect sizes in a lot of research stems from questionable research practices often called “p-hacking”, for example trying different forms of outlier removal or different analysis pipelines and only reporting the one producing the most significant results. As I discussed previously, preregistration aims to control for this by forcing people to be upfront about which procedures they planned to use all along. However, a simpler alternative is to ask authors to demonstrate the robustness of their findings across a reasonable range of procedural options. This achieves the same thing without requiring the large structural change of implementing a preregistration system.

However, while I believe some of the claims about inflated effect sizes in the literature are most likely true, I think there is a more nefarious problem with the statistical approach to inferring such biases. It lies in its very nature, namely that it is based on statistics. Statistical tests are about probabilities. They don’t constitute proof. Just like science at large, statistics never prove anything, except perhaps for the rare situations where something is either impossible or certain – which typically renders statistical tests redundant.

There are also some fundamental errors in the rationale behind some of these procedures. To make an inference about the power of an experiment based on the strength of the observed result is to incorrectly assign a probability to an event after it has occurred. The probability of an observed event occurring is 1 – it is completely irrelevant how unlikely it was a priori. Proponents of this approach try to weasel out of this conundrum by stating that they assume the true effect size to be of a similar magnitude as what was observed in the published experiment and using this as the assumed power of the experiment. This assumption is untenable because the true effect size is almost certainly not that which was observed. There is a lot more to be said about this state of affairs but I won’t go into this because others have already summarized many of the arguments about this much better than I could.

In general I simply wonder how good statistical procedures actually are at estimating true underlying effects in practice. Simulations are no doubt necessary to evaluate a statistical method because we can work with known ground truths. However, they can only ever be approximations to real situations encountered in experimental research. While the statistical procedures for publication bias probably seem to make sense in simulations, their true experimental validity actually remains completely untested. In essence, they are just bad science because they aim to show an effect without a control condition, which is really quite ironic. The very least I would expect to see from these efforts is some proof that these methods actually work for real data. Say we set up a series of 10 experiments for an effect we can be fairly confident actually exists, for example the Stroop effect or the fact visual search performance for a feature singleton is independent of set size while searching for a conjunction of features is not. Will all or most of these 10 experiments come out significant? And if so, will the “excess significance test” detect publication bias?

Whatever the outcome of such experiments on these tests, one thing I already know: any procedure that claims to find evidence that over four of five published studies should not be believed is not to be believed. While we can’t really draw firm conclusions from this, the fact that this rate is the same in two different applications of this procedure certainly seems suspicious to me. Either it is not working as advertised or it is detecting something trivial we should already know. In any case, it is completely superfluous.

I also want to question a more fundamental problem with this line of thinking. Most of these procedures and demonstrations of how horribly underpowered scientific research is seems to make a very sweeping assumption: that all scientists are generally stupid. Researchers are not  automatons that blindly stab in the dark in the hope that they will find a “significant” effect. Usually scientists conduct research to test some hypothesis that is more or less reasonable. Even the most exploratory wild goose chases (and I have certainly heard of some) will make sense at some level. Thus the carefully concocted arguments about the terrible false discovery rates in research probably vastly underestimate the probability of that hypothesized effects actually exist and there is after all “reason to think that half the tests we do in the long run will have genuine effects.”

Naturally, it is hard to put concrete numbers on this. For some avenues of research it will no doubt be lower. Perhaps for many hypotheses tested by high-impact studies the probability may be fairly low, reflecting the high risk and surprise factor of these results. For drug trials the 10% figure may be close to the truth. For certain effects, such as those precognition or telepathy or homeopathy, I agree with Sam Schwarzkopf, Alex Holcombe, and David Colquoun (to name but a few) that the probability that they exist is extremely low. But my guess is that in many fields the probability ought to be better than a coin toss that hypothesized effects exist.

Wine

To cure science the Devil’s Neuroscientist prescribes a generous dose of this potion (produced at farms like this one in New Zealand)

Healthier science

I feel I have sufficiently argued that science isn’t actually sick so I don’t think we need to wreck our heads about possible means to cure it. However, this doesn’t imply we can’t do better. We can certainly aim to keep science healthy or make it even healthier.

So what is to be done? As I have already argued, I believe the most important step we should take is to encourage replication and a polite but critical scrutiny of scientific claims.  I also believe that at the root of most of the purported problems with science these days is the way we evaluate impact and how grants are allocated. Few people would say that the number of high impact publications on a resume tells us very much about how good a scientist a person is. Does anyone? I’m sure nobody truly believes that the number of downloads or views or media reports a study receives tells us anything about its contribution to science.

And yet I think we shouldn’t only value those scientists who conduct dry, incremental research. I don’t know what is a good measure of a researcher’s contribution on their field. Citations are not perfect but they are probably a good place to start. There probably is no good way other than hearsay and personal experience to really know how careful and skilled a particular scientist is in their work.

What I do know is that the replicability of one’s research and the correctness of one’s hypotheses alone aren’t a good measure. The most influential scientists can also be the ones who make some fundamental errors. And there are some brilliant scientists, whose knowledge is far greater than mine (or Sam’s) will ever be and whose meticulousness and attention-to-detail would put most of us to shame – but they can and do still have theories that will turn out to be incorrect.

If we follow down that dead end the Crusaders for True Science have laid out for us, if we trust only preregistered studies and put those who are fortuitous (or risk averse) enough to only do research that ends up being replicated on pedestals, in short, if we only regard “truth” in science, we will emphasize the wrong thing. Then science will really be sick and frail and it will die a slow, agonizing death.

¹ Proponents of preregistration keep reminding us that “nobody” suggests that preregistration should be mandatory or that it should be for all studies. These people I want to ask, what do you think will happen if preregistration becomes commonplace? How would you regard non-registered studies? What kinds of studies should not be preregistered?

² The Devil’s Neuroscientist recently discovered she is a woman but unlike other extra-dimensional entities the Devil’s Neuroscientist is not “whatever it wants to be.”

³ Does anyone still know what a record is? Or perhaps in this day and age they know again?