# Poopooing the P-Value

Suppose you are comparing a new medication against an old medication, both designed to lower cholesterol. The old medication lowered cholesterol by an average of 50 points, and the new medication lowers cholesterol by an average of 75 points. You think the new medication is good until you see that the p-value for the comparison of both groups is 0.051, which is larger than the “statistically significant” cutoff of 0.05.

You throw away your conclusion and stay with the old medication right?

### Yeah, Maybe

When I was working on my dissertation, I was trying to build a statistical model where the expected number of homicides per resident in any given neighborhood in Baltimore was predicted by one or more variables. I included variables one at a time and saw if they were statistically significant at a p-value of 0.05. Those who were significant would move on to form the bigger model, while those who were not would be placed aside.

If I had done this exactly this way, I would have been left with a model with only two variables: Poverty and Disorder. However, I decided to leave in one variable that was not statistically significant by itself or in combination with other variables: the average number of homicides in neighborhoods bordering the neighborhood in question.

### The Black Box of Biostatistics

As I told you before, biostatistics can be somewhat of a black box to those who don’t understand the underlying theory of why numbers come out the way they do. You put something in and you get something out, but you don’t really understand what happened in the process. That’s the definition of a black box.

Not to be confused with 90’s dance group “Black Box,” by the way:

Anyway, not a lot of healthcare providers are trained fully in the theory and practice of biostatistics. They go to one conference after another where papers and studies are presented but only if the studies found some difference between their study groups that was statistically significant. Anything less than that and the study probably doesn’t even see the light of day.

If the study findings do see the light of day, they’re presented without much of the underlying data. (Good luck trying to get any researcher to share “their” data.) So all you’re left with to make a clinical (or public health) decision is what the study says. If it kind of jives with what you know — or believe — to be true, then you go with it. If the findings are revolutionary, you kind of want to see more evidence. All the while, you have no real clue on how or why or if the data said what the researchers said that it did.

You kind of go for it on faith, right?

### You’re in Luck!

Lucky for you, biostatisticians alone don’t do all the research, analysis, interpretation and policy decisions. A good research project has a multidisciplinary team. A good paper writes out the origins of the idea for the study, the background information you need, what others have done and the source of the data. And a good clinician/practitioner uses their scientific knowledge, the entirety of the evidence and their own best judgment to decide on whether or not to go with a study’s findings.

### What About P?

There’s a nascent movement to change our approach to p-values and similar statistical measurements because they’ve become too difficult to deal with when deciding on a study’s findings. From an article in *Nature*:

Again, we are not advocating a ban on P values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.

One reason to avoid such ‘dichotomania’ is that all statistics, including P values and confidence intervals, naturally vary from study to study, and often do so to a surprising degree. In fact, random variation alone can easily lead to large disparities in P values, far beyond falling just to either side of the 0.05 threshold. For example, even if researchers could conduct two perfect replication studies of some genuine effect, each with 80% power (chance) of achieving P < 0.05, it would not be very surprising for one to obtain P < 0.01 and the other P > 0.30. Whether a P value is small or large, caution is warranted.

We must learn to embrace uncertainty. One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence. Specifically, we recommend that authors describe the practical implications of all values inside the interval, especially the observed effect (or point estimate) and the limits. In doing so, they should remember that all the values between the interval’s limits are reasonably compatible with the data, given the statistical assumptions used to compute the interval. Therefore, singling out one particular value (such as the null value) in the interval as ‘shown’ makes no sense.”

It’s true. People see a p-value of 0.051 and discard the entirety of the study’s findings, as I’ve written above. But I particularly like the part about embracing uncertainty. Although I don’t think that they mean uncertainty as in not knowing something. I think they mean uncertainty as in “change one parameter and everything changes… We can’t measure everything, or everyone, after all.”

Take, for example, what happened with a rotavirus vaccine back in the late 1990s. After the vaccine was licensed, physicians began reporting cases of intussusception, a condition in which the intestine folds in on itself. They reported enough cases to warrant further investigation. (Intussusception is life-threatening if left untreated.) Post-licensure surveillance and case-control studies found that an increased risk of intussusception was associated with getting the vaccine, so the vaccine was withdrawn from the market.

A different vaccine manufacturer was ready to pull the plug on its rotavirus vaccine, but a scientist within the company convinced the bosses to go ahead. He was that confident of his vaccine. However, in order to avoid not seeing an adverse event when there is one, they were going to need thousands more children in the clinical trial… Tens of thousands, actually.

This was because the first clinical trial saw no difference in the two groups: vaccinated and unvaccinated. It saw no difference partly because the size of the groups was not big enough to see cases of intussusception that occur so rarely to begin with. To see just one case of **naturally** occurring intussusception, you’d have to observe between 17,000 and 100,000 children. So the next trial had to have a lot of children in the control and intervention groups in order to see if there was a difference.

In that second vaccine trial, there was no detected increase in the risk of intussusception between the groups, and the intervention group had less incidence of rotavirus. Still, post-marketing surveillance was very active to try and detect any issues that were similar to the previous vaccine. They haven’t, but… But there is still an increased risk of intussusception if you get rotavirus **or** the vaccine, compared to not getting either. (But that’s a discussion for a later time.)

### See What I Mean?

On the one hand, giving a dichotomous (yes/no) value to a p-value or a confidence interval makes it easy to make a decision on whether or not the numbers you just crunched make sense. On the other hand, medical and public health decisions are much more complex than just a dichotomous decision. We’re all similar, but we’re not. We all react the same way to medications, for the most part. And it’s in dissecting those dissimilarities where get in trouble over the dichotomous nature of how we interpret p-values.

Maybe in the future there will be a better way to make decisions on statistical analyses than to just reject or fail-to-reject the null and alternative hypotheses… Until then, decision makers need to be better equipped to understand how the data on which they’re basing their decisions was collected, analyzed and why/how he statistics that were computed came out the way they did.