How scientists massage results with ‘P-Hacking’

Jonathan KitchenGetty Images

The pursuit of science is designed to search for meaning in a maze of data. At least that’s how it should work.

By some accounts, that facade began to crumble in 2010, when Cornell University social psychologist Daryl Behm published a 10-year analysis in the prestigious Journal of Personality and Social Psychology, demonstrating with widely accepted statistical methods that extrasensory perception (ESP), essentially a “sixth sense,” is an observable phenomenon. Behm’s colleagues were unable to reproduce the paper’s results, quickly blaming what we now call “p-hacking,” the process of massing and overanalyzing your data in search of statistically significant and publishable results.

♾ You love math. As do we. Let’s dive deep into its intricacies together – join Pop Mech Pro.

To support or refute a hypothesis, the goal is to establish statistical significance by recording a “p-value” of less than 0.05, explains Benjamin Baer, ​​a postdoctoral researcher and statistician at the University of Rochester whose recent work looks at dealing with this problem. The “p” in p-value stands for probability and is a measure of how likely the null hypothesis result is relative to chance.

For example, if you want to test whether all roses are red or not, you would count the number of red roses and roses of other colors in a sample and perform a hypothesis test to compare the values. If this test produces a p-value of less than 0.05, then you have statistically significant grounds for asserting that only red roses exist—even though evidence outside of your sample of flowers suggests otherwise.

Abusing p-values ​​to support the idea that ESP exists may be relatively harmless, but when the practice is used in medical trials, it can have far more deadly results, Baer says. “I think the big risk is that the wrong decision can be made,” he explains. “There’s a big debate going on in science and statistics trying to figure out how to make sure that this process can run more smoothly and that the decisions are actually based on what they should be.”

Baer was the first author of a paper published in late 2021 in the journal PNAS along with his former Cornell mentor and statistics professor Martin Wells, who is exploring how new statistics can improve the use of p-values. The metric they looked at is called the frailty index and is designed to complement and improve upon p-values.

This measure describes the vulnerability of a data set to some of its data points flipping from a positive to a negative result—for example, if a patient who was positively affected by a drug did not actually feel affected. If a change in just a few of these data points is enough to drop a result from statistically significant to not, then it is considered unstable.

p value curve


In 2014, physician Michael Walsh originally proposed the frailty index in Journal of Clinical Epidemiology. In the paper, he and his colleagues applied the frailty index to just under 400 randomized control trials with statistically significant results and found that one in four had low frailty scores, meaning their findings may not actually be very reliable or robust.

However, the frailty index has not yet gained much traction in medical trials. Some critics of the approach have emerged, such as Ricky Carter of the Mayo Clinic, who says it is too similar to p-values ​​without offering enough improvement. “The irony is that the volatility index is a p-hacking approach,” says Carter.

“Talking to the victim’s family after a botched operation is very different [experience] than statisticians sitting at their desks doing math.

To improve the volatility index, Baer, ​​Wells, and colleagues focused on improving two main elements to address previous criticisms: making only sufficiently likely modifications and generalizing the approach to work beyond 2×2 binary tables (representing a positive or negative control and experimental group results).

Despite the uphill battle the volatility index has fought so far, Baer says he still believes it’s a useful metric for medical statisticians, and hopes the improvements made in their recent work will help convince others of the same.

“Talking to the victim’s family after a botched operation is very different [experience] than statisticians sitting at their desks doing math,” Baer says.

Leave a Comment