P-hacking (or data dredging, data fishing, data snooping) is the use of data mining to discover patterns which are presented as statistically significant, but the analysis is done by exhaustively searching various combinations of variables for correlation.
The problem occurs when someone unduly influences the data collection process or statistical analysis to produce a statistically significant result. One way to influence the analysis is choosing the factors to control and which observations to record and compare. As there is no one right way to proceed, researchers often make these calls on the fly until they get a result they have been looking for.
P-hacking is a major issue in published research papers. There is an increasing concern that many published papers are false positives. There are strong incentives to publish statistically significant results, and many employers and funders count papers and weigh them by the journal’s popularity to assess a researcher’s performance. As there is very little incentive to replicate research, false positive results entering the research can be very persistent.
As researchers heavily rely on the p-value to prove the significance of results, let’s understand how to interpret a p-value.
What is a p-value?
In statistics when you perform a hypothesis test, the p-value calculated helps you determine the significance of your results. Let’s look at a quick overview of hypothesis testing to understand what a p-value signifies.
Specify the null and alternative hypotheses
Let’s say the claim is that the average time to cook instant noodles is 2 minutes. Then the null hypothesis is “It takes two minutes to cook instant noodles”. If we want to prove/disprove the claim, we would conduct a hypothesis test to challenge the null hypothesis. Our alternative hypothesis may be “It takes more than two minutes to cook instant noodles” (It may also be “less than two minutes” depending on what we want to prove).
Calculate the test statistic and corresponding p-value
In short, a p-value is just a number between 0 and 1 and is interpreted the following way :
- A small p-value (p<0.05) indicates strong evidence against the null hypothesis, so we reject the null hypothesis
- A large p-value (p>0.05) indicates weak evidence against the null hypothesis, so we fail to reject the null hypothesis
Now that we have a general understanding of a p-value, let’s look at an example of p-hacking.
Example of p-hacking
The comic strip above explains a case of p-hacking. Below is the summary of the example for a clearer understanding :
Imagine a news reporter trying to find out if consumption of jelly beans causes acne. To test the claim, scientists conduct an experiment and find out there is no significant relationship (p>0.05) between acne and consumption of jelly beans.
Then, the reporter revises the claim to acne must be dependent on the flavor of the jelly bean consumed. So, now the scientists conduct the experiment for 20 flavors of jelly beans individually. Nineteen of the flavors show no significant relationship,but by chance, there is a high correlation between jelly bean consumption and acne breakouts for green jelly beans.
Using this result, the newspaper creates headlines for the story as “Green Jelly Beans Linked to Acne! 95% Confidence. Only 5% Chance of Coincidence!”. This comic makes a critical point that there are plenty of ways to distort data and apply statistics to show absurd correlations.
An effective way to avoid p-hacking is to stay away from making any selections or tweaks after seeing the data. (In the jelly bean example, an initial hypothesis was established i.e “Consumption of jelly beans causes acne”. This hypothesis should not have been tweaked after testing it.) A detailed research plan which lists the hypothesis to be tested and the statistical analysis to be performed should be a prerequisite for conducting an experiment.
Another remedy for p-hacking is to perform cross-validation tests. In this method, the researcher collects the data and partitions it into two subsets- A and B. Only one subset, A, is examined for creating hypotheses and it is tested on the other subset, B, which was not used for creating the hypotheses. Only when B also supports the hypotheses created for A, it would be reasonable to believe that the hypothesis is valid.