Most of the experimental work in behavioral and biomedical science involves group comparison – in the simplest case, experimental and control group. The mean data are compared and the variability in each group is used to estimate the probability that each mean difference occurred by chance. The commonly used estimation method, the statistical hypothesis test (NHST), was developed by Ronald Fisher in a context that needs to be discussed shortly.
There are serious problems moving from group data to the properties of individuals (problem for the object of study: Staddon, 2019), but the first major problem encountered with the NHST method was replication, outlined in a remarkable article by John Ioannis (2005). The replication problem is fixable and various solutions have been proposed. In July 2017, a letter to science (Benjamin et al.), Signed by more than 70 statisticians, suggest that the solution to the NHST reproducibility problem is to set a criterion (alpha level) for rejecting the null (no effect) hypothesis of p = .005, rather than of Ronald Fisher’s Proposition for p = .05, the then standard. The authors argue that instead of choosing one in twenty (or less) chances of error as good enough to accept the hypothesis that your treatment has an effect, the standard should be raised to one in two hundred.
Reproducibility would sometimes be improved by a stricter criterion; but such a small p-value would also eliminate many social studies that use NHST; the percentage of publications in the social and biomedical sciences will fall sharply. Partly for this reason, more than 80 scientists signed a letter from November 2017 (Lakens et al, 2017) to nature Rejecting the proposal of Benjamin et al., instead recommending that the label “statistically significant” should no longer be used “and instead concluding that” researchers should transparently report and justify all choices they make when designing research, including alpha [critical p-value] level. ”
After this point, the situation seems to have calmed down (for summaries, see Baker, 2016 and Staddon, 2017). There have been some mitigation efforts: pre-registration of hypotheses, use of larger groups, etc. But the NHST method continues to be widely used.
The emphasis in both long letters is on the issue of replication. But a little thought shows that the Fisher method is, in fact, completely unsuitable for basic science. The reason is shamefully simple.
Source: Alena Nv / Shutterstock
The NHST method was invented by Fisher when he was working in an application environment, an environment in which a decision must be made – between two fertilizers or other treatments on an agricultural plot, for example. Each fertilizer had a certain estimated net benefit. The one with the significantly higher benefit was then chosen. The cost of error – choosing a worse fertilizer, a false positive result – is small and potentially measurable. In this case, the price is not a problem. It is only necessary to answer the question: which fertilizer is probably better? For this choice, the 5 percent criterion is perfectly appropriate.
In basic science, the situation is very different: the choice is: “confirm” or “I do not know”; but the cost of error is much higher. The benefit of correctly confirming a true experimental hypothesis (ie, rejecting the null hypothesis) is a modest contribution to knowledge. But the cost of error, which seems to confirm a hypothesis that is actually false (false positive, type II error), can be very high for both science and society. False positive results, just like scientific fraud, can have many detrimental effects (see, for example, Ritchie, 2020 and Andrew’s publication, 2021). Subsequent research, in some cases a lot of research, will go into the rabbit hole and waste time and probably generate more errors. And as recent reports (eg Randall et al., 2021) point out, the cost, both human and financial, of public policy based on scientific error can be enormous.
So the conclusion is simple: the fishing method is suitable for choosing between two types of fertilizer; it is inadequate to decide between truth and falsehood. It is simply wrong to rely on the NHST method in basic social or biomedical science.
What is the alternative? There is no obvious answer other than human ingenuity. Perhaps it is worth remembering that Hermann Ebbinghaus discovered the basic laws of memory using only one subject, himself, in studies that would not now meet the algorithmic standards of the NHST community, so don’t “Bend it like Beckham!” But “Think like Ebbinghouse”?