Of self-correction and selfless errors

I had originally planned to discuss today’s topic at a later point, perhaps as part of my upcoming post about the myths of replication. However, discussions surrounding my previous posts, as well as the on-going focus on the use of posthoc power analysis in the literature, led me to the decision to address this point now.

A central theme of my previous two posts was the notion that science is self-correcting and that replication and skepticism are the best tools at our disposal. Discussions my alter ego Sam has had with colleagues, as well as discussions in the comment section on this blog and elsewhere, reveal that many from the ranks of the Crusaders for True Science call that notion into question. In that context, I would like to thank a very “confused” commenter on my blog for referring me to an article I hadn’t read, which is literally entitled “Why science is not necessarily self-correcting?“. I would also like to thank Greg Francis who commented on his application of statistical tests to detect possible publication bias in the literature. Recently I also became aware of another statistical procedure based on assumptions about statistical power called the Replication Index that was proposed as an alternative to the Test of Excess Significance used by Francis. I think these people are genuinely motivated by a selfless desire to improve the current state of science. This is a noble goal but I think it is fraught with errors and some potentially quite dangerous misunderstandings.

The Errors of Meta-Science

I will start with the statistical procedures for detecting publication bias and the assertion that most scientific findings are false positives. I call this entire endeavor “meta-science” because this underlines the fundamental problem with this whole discussion. As I pointed out in my previous post, science is always wrong. It operates like a model-fitting procedure to gradually improve the explanatory and predictive value of our attempts to understand a complex universe. The point that people are missing in this entire debate about asserted false positive rates, non-reproducibility, and publication bias is that the methods used to make these assertions are themselves science. Thus these procedures suffer from the same problems as any scientific effort in that they seek to approximate the truth but can never actually hope to reach it. By using scientific methods to understand the workings of the scientific method, the entire logic of this approach is circular.

Circular inference has recently received a bit of attention within neuroscience. I don’t know if the authors of this paper actually coined the term “voodoo correlations”. Perhaps they merely popularized it. The same logical fallacy has also been called “double-dipping“. However, all this is really simply circular reasoning and somewhat related to “begging the question“. It is a more a problem with flawed logic than with science. Essentially, it is what happens when you use the same measurements to test the validity of your predictions as you did for making the predictions in the first place.

This logical fallacy can result in serious errors. However, in the real world this isn’t actually entirely avoidable and it isn’t always problematic as long as we are aware of its presence. For instance, a point most people are missing is that whenever they report something like t(31)=5.2, p<0.001, or a goodness-of-fit statistic, etc, they are using circular inference. You are reporting an estimate of the effect size (be it a t-statistic, the goodness-of-fit, Cohen’s d or others) based on the observed data and then draw some sort of general conclusion from it. The goodness of a curve fit is literally calculated using the accuracy with which the model predicts the observed data. Just observing an effect size, say, a difference in some cognitive measure between males and females, can only tell you that this difference exists in your sample. You can make some probabilistic inferences about how this observed effect may generalize to the larger population and this is what statistical procedures do – however, in truth you cannot know what an effect means for the general population until you checked your predictions through empirical observations.

There are ways to get out of this dilemma, for example through cross-validation procedures. I believe this should be encouraged, especially whenever a claim about the predictive value of a hypothesis is made. More generally, replication attempts are of course a way to test predictions from previous results. Again, we should probably encourage more of that and cross-validation and replication can ideally be combined. Nevertheless, the somewhat circular nature of reporting observed effect sizes isn’t necessarily a major problem provided we keep in mind what an effect size estimate can tell us and what it can’t.

The same applies to the tests employed by meta-science. These procedures take an effect size estimate from the scientific literature, calculate the probability that this effect could have occurred under the conditions (statistical power), and then make inferences on this posthoc probability. The assumptions on which these procedures are based remain entirely untested. In so far that they make predictions at all, such as whether the effect is likely to be replicated in future experiments, no effort is typically made to test them. Statistical probabilities are not a sufficient replacement for empirical tests. You can show me careful, mathematically coherent arguments as to why some probability should be such and such – if the equation is based on flawed assumptions and/or it doesn’t take into account some confounds, the resulting conclusions may be untenable. This doesn’t necessarily mean that the procedure is worthless. It is simply like all other science. It constructs an explanation for the chaotic world out there that may or may not be adequate. It can never be a perfect explanation and we should not treat it as if it were the unadulterated truth. This is really my main gripe with meta-science: Proponents of these meta-science procedures treat them as if they were unshakable fact and the conviction with which some promote these methods borders on religious zeal.

One example is the assertion that a lot of scientific findings are false positives. This argument is based on the premise that many published experiments are underpowered and that publication bias (which we know exists because research is actively seeking positive results) means that mainly positive findings are reported. In turn this may explain what some have called the “Decline Effect“, that is, initial effect size estimates are inflated and they gradually decrease and approach the true effect size as more and more data are collected.

I don’t deny that lack of power and publication bias probably exist. However, I also think that current explanations are insufficient for explaining all the data. There are other reasons that may cause a reduction of effect size estimates as time goes on and more and more attempts at replication are made. Few of them are ever formally taken into account by meta-science models, partly because they are notoriously difficult to quantify. For instance, there is the question of whether all experiments are comparable. Even with identical procedures carried out by meticulous researchers, the quality of reagents, of experimental subjects, or generally the data we measure can differ markedly. I think this may be a particularly bad problem in psychology and cognitive neuroscience research although it probably exists in many areas of science. I call this the Data Quality Decay Function:

SubjectQualityTake for example the reliability and quality of data that can be expected from research subjects. In the early days after the experiment was conceived we test people drawn from subject pools of reliable research subjects. If it is a psychophysical study on visual perception, chances are that the subjects are authors on the paper or at least colleagues who have considerable experience with doing experiments. The data reaped from such subjects will be clean, low noise estimates of the true effect size. The cynical might call these kinds of subjects “representative” and possibly even “naive”, provided they didn’t co-author the paper at least.

As you add more and more subjects the recruitment pool will inevitably widen. At first there will be motivated individuals who don’t mind sitting in dark rooms staring madly at a tiny dot while images are flashed up in their peripheral vision. Even though the typical laboratory conditions of experiments are a long way from normal everyday behavior, people like this will be engaged and willing enough to perform reasonably on the experiment but they may fatigue more easily than trained lab members. Moreover, sooner or later your subject pool will encompass subjects with low motivation. There will be those who only take part in your study because of the money they get paid or (even worse) because they are coerced by course credit requirements. There may even be professional subjects who participate in several experiments within the same day. You can try to control this but it won’t be fool-proof, because in the end you’ll have to take them by their word even if you ask them whether they have participated in other experiments. And to be honest, professional subjects may be more reliable than inexperienced ones so it can be worthwhile to test them. Also, frequently you just don’t have the luxury to turn away subjects. I don’t know about your department but people weren’t exactly kicking in the doors of any lab I have seen just to participate in experiments.

Eventually, you will end up testing random folk off the street. This is what you will want if you are actually interested in generalizing your findings to the human condition. Ideally, you will test the effect in a large, diverse, multicultural, multiethnic, multiracial sample that encompasses the full variance of our species (this very rarely happens). You may even try to relax the strict laboratory conditions of earlier studies. In fact you’ll probably be forced to because Mongolian nomads or Amazonian tribeswomen, or whoever else your subject population may be, just don’t tend to hang around psychology departments in Western cities. The effect size estimate under these conditions will almost inevitably be smaller than those in the original experiments because of the reduced signal-to-noise ratio. Even if the true biological effect is constant across humanity, the variance will be greater.

This last point highlights why it isn’t so straightforward to say “But I want my findings to generalize so the later estimate reflects the truth more accurately”. It really depends on what your research question is. If you want to measure an effect and make general predictions as to what this means for other human beings, then yes, you should test as wide a sample as possible and understand why any meaningful effect size is likely to be small. Say, for instance, we are testing the efficacy of a new drug and characterize its adverse effects. Such experiments should be carried out on a wide-ranging sample to understand how differences between populations or individual background can account for side effects and whether the drug is even effective. You shouldn’t test a drug only on White Finnish men only to find out later that it is wholly useless or even positively dangerous in Black Carribean women. This is not just a silly example – this sort of sampling bias can be a serious concern.

On the other hand, when you are testing a basic function of the perceptual system in the human brain, testing a broad section of the human species is probably not the wisest course of action. My confidence in psychophysical results produced by experienced observers, even if they are certifiably non-naive and anything-but-blind to the purpose of the experiment (say, because they are the lead author of the paper and coded the experimental protocol), can still be far greater than it would be for the same measurements from individuals recruited from the general population. There are myriad factors influencing the latter that are much more tightly controlled in the former. Apart from issues with fatigue and practice with the experimental setting, they also may simply not really know what to look for. If you cannot produce an accurate report of your perceptual experience, you aren’t going to produce an accurate measurement of it.

Now this is one specific example and it obviously does not have to apply to all cases. I am pretty confident that the Data Quality Decay Function exists. It occurs for research subjects but it could also relate to reagents that are being reused, small errors in a protocol that accumulate over time, and many other factors. In many situations the slope of the curve may be very shallow so that the decay is really non-existent. There are also likely to be other factors that may counteract and, in some cases, invert the function. For instance, if the follow-up experiments actually improve the methodology of an experiment the data quality might even be enhanced. This is certainly the hope we have for science in general – but this development may take a very long time.

The point is, we don’t really know very much about anything. We don’t know how data quality, and thus effect size estimates, vary with time, across samples, between different experimenters, and so forth. What we do know is that under the most common assumptions (e.g. Gaussian errors of equal magnitude across groups) the sample sizes we can realistically use are insufficient for reliable effect size estimates. The main implication of the Data Quality Decay Function is that the effect size estimates under standard assumptions are probably smaller than the true effect.

While I am quite a stubborn lady, as I said earlier, I am not so stubborn to think this is the sole explanation. We know publication bias exists and so it is almost inevitable that it affects effect sizes in the literature. I also think that even if some of the procedures used to infer false positive rates and publication bias are based on untested assumptions and on logically flawed posthoc probabilities, they reveal some truths. All meta-science is wrong – but that doesn’t make it wholly worthless. I just believe we should take it with a grain of salt and treat it like all other science. In the long run, meta-science will self-correct.


Sometimes when you’re in a local minimum the view is just better

Self-correction is a fact

This brings me to the other point of today’s post, the claim that self-correction in science is a myth. I argue that self-correction is inherent to the scientific process itself. All the arguments against self-correction I have heard are based on another logical fallacy. People may say that the damn long time it took the scientific community to move beyond errors like phrenology or racial theories demonstrates that science does not by itself correct its mistakes. They suggest that because particular tools, e.g. peer review or replication, failed to detect serious errors or even fraudulent results, means that science itself does not weed out such issues.

The logical flaw here is that all of these things are effectively point estimates from a slow time series. It is the same misconception as to why people deny that global temperatures are rising because we have had some particularly cold winters in some specific years in fairly specific corners of the Earth or the error that leads creationists to claim that evolution has never been observed directly. It is why previous generations of scientists found it so hard to accept the thought that the surface of the Earth comprises tectonic plates floating on liquid magma. Fortunately, science has already self-corrected that latter misconception and seismology and tectonics are widely accepted theories well beyond the scientific community. Sadly, evolution and climate change have not arrived at the same level of mainstream acceptance.

It seems somewhat ironic that we as scientists should find it so difficult to understand that science is a gradual, slow process. After all we are all aware of evolutionary, geological, and astronomical time scales. However, in the end scientists are human and thus subject to the same perceptual limits and cognitive illusions as the rest of our species. We may get a bit of an advantage compared to other people who simply never need to think about similar spatial and temporal dimensions. But in the end, our minds aren’t any better equipped to fathom the enormity and age of the cosmos than anybody else’s.

Science is self-correcting because that is what science does. It is the constant drive to seek better answers to the same questions and to derive new questions that can provide even better answers. If the old paradigms are no longer satisfactory, they are abandoned. It can and it does happen all the time. Massive paradigm shifts may not be very frequent but that doesn’t mean they don’t happen. As I said last time, science does of course make mistakes and these mistakes can prevail for centuries. Using again my model-fitting analogy one would say that the algorithm gets stuck in a “local minimum“. It can take a lot of energy to get out of that but given enough time and resources it will happen. It could be a bright spark of genius that overthrows accepted theories. It could be that the explanatory construct of the status quo becomes so overloaded that it collapses like a house of cards. Or sometimes it may simply be a new technology or method that allows us to see things more clearly than before. Sometimes dogmatic, political, religious, or other social pressure can delay progress, for example, for a long time your hope of being taken seriously as a woman scientist was practically nil. In that case, what it takes to move science forward may be some fundamental change to our whole society.

Either way, bemoaning the fact that replication and skeptical scrutiny haven’t solved all problems and managed to rectify every erroneous assumption and refute every false result is utterly pointless. Sure, we can take steps to ensure that the number of false positives is reduced but don’t go so far to make it close to impossible to detect new important results. Don’t make the importance of a finding dependent on it being replicated hundreds of times first. We need replication for results to stand the test of time but scientists will always try to replicate potentially important findings. If nobody can be bothered to replicate something, it may not be all that useful – at least at the time. Chances are that in 50 or 100 or 1000 years the result will be rediscovered and prove to be critical and then our descendants will be glad that we published it.

By all means, change the way scientists are evaluated and how grants are awarded. I’ve said it before but I’ll happily repeat it. Immediate impact should not be the only yardstick by which to measure science. Writing grant proposals as catalogs of hypotheses when some of the work is inevitably exploratory in nature seems misguided to me. And I am certainly not opposed to improving our statistical practice, ensuring higher powered experiments, and encouraging strategies for more replication and cross-validation approaches.

However, the best and most important thing we can do to strengthen the self-correcting forces of science is to increase funding for research, to fight dogma wherever it may fester, and to train more critical and creative thinkers.

“In science it often happens that scientists say, ‘You know that’s a really good argument; my position is mistaken,’ and then they would actually change their minds and you never hear that old view from them again. They really do it. It doesn’t happen as often as it should, because scientists are human and change is sometimes painful. But it happens every day. I cannot recall the last time something like that happened in politics or religion.”

Carl Sagan