Written by Clay Smith

**Spoon Feed**

A p-value is the probability that an observed outcome (or more extreme outcome) would occur if, in fact, the null hypothesis was true. Say what?? It’s tricky to explain! This post explains what a p-value is in simple language.

One of the most popular posts on JournalFeed, by far, is Idiot’s Guide to Odds Ratios. I think that’s because I wrote it with a view to understand the subject myself. I am – what you might call – computationally challenged. In other words, I’m not great at math. And I think that’s a blessing in disguise, because for me to understand a mathematical concept, I have to distill it to the “explain-it-like-I’m-five” level. So, I am hoping to regularly bring you *Stats for Average People*. For those of you who want a refresher or basic primer on some core stats concepts, explained at a level we can all understand, welcome! This is a feature length post and not the usual type of Spoon Feed.

When we read research, we need to know if the results are statistically significant. But what does it mean to be statistically significant? Who determines what is significant? How is this calculated? Today, we are going to focus on one form of determining statistical significance, the p-value.

## The Truth Is Out There

First of all, let’s talk about why statistical significance matters. Most of us believe there is objective truth in the natural world. Drug X really is better than placebo. Technique Y really is a more effective way to place an endotracheal tube. The problem is, we have a very hard time determining the truth, because our ability to measure things is subject to random – or not so random – error. The field of statistics exists, in part, to help us to evaluate truth claims about the natural world using numerical evidence.

Chance happens, but in the field of medicine, we donâ€™t want to implement something in practice only to find out it was a fluke, an accident. In fact, we want to make sure there is a low probability that what happened in an experiment was due to chance before we use a test, give a drug, or perform a procedure on our patients.

There are different ways to quantify the evidence for or against a hypothesis. As mentioned, we are focusing on one way in this post: the p-value.

Letâ€™s think about what might happen in research. Pretend we have a new anticoagulant for acute pulmonary embolism, and we are measuring an important, patient-centered outcome – mortality. When you study this drug, you might find that compared to placebo, there is no difference in mortality. Or you might find that there is a difference in mortality. It seems there are only 2 possible outcomes. But, in reality, you might *truly* conclude there was no difference, or you might *falsely* conclude there was no difference vs placebo. In addition, you might *truly* conclude there was a difference or *falsely* conclude there was a difference. In other words, because chance happens and errors in measurement occur, there are 4 potential outcomes.

We could show this in a 2×2 table:

## The Language of Hypothesis Testing

In my mind, p-values feel confusing because they seem to go about things backwards. When I make a hypothesis, I usually say *there is a difference* between groups with an exposure (or predictor) and an outcome (or effect). But when we think about what p-value means, we must first assume *there is no difference* between exposure and an outcome. This is called the null hypothesis, often abbreviated H_{0}. The null hypothesis says there is no difference between the exposed group and control group in the outcome or there is no association between the exposure and outcome. Although this seems backwards at first, the best parallel is the U.S. criminal justice system. A jury assumes a defendant is innocent until there is overwhelming evidence (beyond a reasonable doubt) that this is not true. Then the jury must reject the assumption of innocence and accept the alternative – the defendant is guilty. In the same way, we speak of “rejecting the null hypothesis,” a double negative. In other words, to reject the null hypothesis is to say – â€œI assumed there would be no difference. But I was wrong, and there was enough evidence to conclude that there *was* a difference.” To reject the null hypothesis is also is to accept the alternative hypothesis (often abbreviated H_{a} or H_{1}). Whew! In contrast, to accept the null hypothesis is to say that we assumed there would be no difference, and that is, in fact, what our experiment found. Wouldn’t it be nice to know how strong the evidence is against the null hypothesis? This is where the p-value comes into play.

## What is a p-value?

A p-value helps us determine if the null hypothesis is valid or not. In other words, a p-value gives us a way to assess the strength of evidence against the null hypothesis.

A detailed definition would be this: A p-value is the probability of obtaining an observed outcome – or a more extreme outcome, assuming the null hypothesis is true. You see, this is why teaching about p-values is hard. It’s hard to explain and hard to understand. The p-value includes the observed effect and the entire “tail” of more extreme effects. So, it is not a point estimate. It is an area under the curve. See figure below. Again, consider the analogy of criminal justice – innocent until ‘proven’ guilty. In a criminal trial, we are trying to answer the question – assuming this person is innocent, what is the probability we would see such extreme criminal evidence (or more extreme) if that were true? A p-value is an assumption of no effect and a quantification of the evidence that we would observe such an extreme effect after conducting our research if this were true.

Let’s put this in terms of an example. If we had two cohorts of PE patients, and the anticoagulant group had 2% mortality and placebo group had 9% mortality (difference 7%; p=0.04), then we would say it like this. The p-value in this scenario means: There is a 4% probability that we would see this observed mortality difference (namely a 7% or more extreme difference) if the truth of the universe is that there is no difference in drug vs placebo. This means it is improbable we would see such a result in a world in which drug = placebo. It’s improbable enough, and there is enough evidence against drug = placebo that we would likely conclude that drug â‰ placebo.

The p-value can range from 0 to 1. The closer it is to zero, the more improbable it is that the null hypothesis is true. The smaller the p-value, the stronger the evidence against the null hypothesis. The larger the p-value, the weaker the evidence against the null hypothesis. By convention, we usually arbitrarily set a p-value of 0.05 as the threshold of statistical significance. Thus, an observed outcome value â‰¥0.05 is not considered statistically significant. An observed outcome value <0.05 is considered statistically significant. To use the criminal justice analogy, 0.05 is the agreed upon cut point below which we consider the result “beyond a reasonable doubt.” Why did we all arbitrarily decide 0.05 was the right number? Well, Ronald Fisher (of Fisher’s exact fame), a British statistician, just stated it as his preference, and it became widely accepted. He wrote in 1926, “If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty or one in a hundred. Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fails to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significanceâ€¦”. His paper dealt with farming and a hypothetical scenario of whether or not putting manure on a field would increase crop yields. That’s right, a p-value of 0.05 came from a discussion of poo.

Graphically, a p-value looks like this.

The graphic above shows a two-tailed p-value. Most of the time in medicine an experimental result could have a positive or negative effect; it could help or harm. A coin could be heads or tails. So, most of the time, a p-value reflects a two-tailed calculation, in which p = 0.05, and 0.025 of that is on the negative tail and 0.025 is on the positive tail. As an exercise, imagine a coin toss. If the coin is fair, we expect 50% heads and 50% tails. If we flip the coin and get heads 5 times in a row, the probability of that outcome is 0.5^5 = 0.03125, roughly 3%. And a one-tailed p = 0.03125. However, think of what is happening with the coin. Let’s assume it is a fair coin and could equally land on heads or tails. So, a two-tailed p-value is in order here. The probability of getting heads 5 times in a row is 0.03125, but the probability of getting tails 5 times in a row is also 0.03125. So, the two-tailed p = 0.0625. So, if we got heads 5 times in a row, and we determined in advance that 0.05 was our level of statistical significance, we would not reject the null hypothesis in this case.

## Why does a low p-value matter?

So, why does a small p-value, say <0.05, matter? It matters because we want to know the truth. But we can’t know the truth in the natural world with 100% certainty; we can’t *prove* the alternative hypothesis. But we can quantify the level of uncertainty. If a measured outcome is highly improbable under the assumption the null hypothesis is true, we reject that notion and act accordingly. If a new anticoagulant has a real effect – a true effect – on mortality, then we want to give this drug to our patients with PE. If we designed our experiment well, we should find that there is strong evidence against the null hypothesis (drug = placebo is incorrect), and we should start using it on our patients. We need to be reasonably confident that an effect is true before treating a patient, because in the real world, drugs sometimes have side effects and always cost money. We don’t want to give a drug that doesn’t work, may cause harm, and will cost a patient money if, in fact, there is even a moderate likelihood that an observed drug effect is due to chance alone. Again, this is extremely important to note, due to random errors in measurement and biological variability, we can’t prove a treatment works. We could always be wrong. This is why I’m cautious when prescribing a brand new drug or using a new test. The hallmark of good science is the ability to repeat a study under similar conditions and get similar results. It is very reassuring when multiple studies in diverse patient populations, with clinically meaningful, patient-centered outcomes all show a statistically significant effect. We still don’t have proof, but it is very strong evidence against the null hypothesis and affirms we are probably doing the right thing for our patients.

## How to calculate a p-value

I’m going to give the basic steps, because the specific calculation depends on the type of hypothesis test used.

- Determine the null hypothesis and alternative hypothesis.
- Take the data from your research and calculate a test statistic (i.e. a t-statistic or z score for a mean). In real life, this is where the effect and those more extreme, are found on the bell curve.
- Use tables or software to calculate the p-value which corresponds to the test statistic used.
- If the calculated p-value is less than alpha, which was determined in advance, then you reject the null hypothesis. If p>alpha, you accept the null hypothesis.

## Pfallacies – What a p-value is *not*

Here are some common misunderstandings about the p-value. Since a p-value is not that intuitive, it’s important to not only define what a p-value is but what it’s not.

**Statistical â‰ clinical significance**

Just because a result is statistically significant (p < 0.05) does not mean it is clinically significant. If a new pain reliever is studied, and the research finds that there is a statistically significant 2mm reduction in a 0-100 mm visual analog pain scale, we might be tempted to use the new drug, assuming it works. But what if I told you that previous studies have shown that there is no clinically significant difference in patient functional status unless there is a 10 mm reduction in pain score? The study may have found a statistical difference, but a 2mm pain score reduction is not clinically meaningful. That is, it doesn’t help the patient.

**Highly significant?**

Next, a very small p-value is often referred to as highly significant. Actually, a very small p-value means that, given the outcome, there is strong evidence against the null hypothesis. But it says nothing about the significance of the finding. Significance has to do with what we determine in advance is our alpha threshold (0.05, 0.01, etc) and whether or not the results are clinically meaningful to patient-centered outcomes (see above paragraph).

**P = 0.05 is arbitrary, not absolute**

It’s important to acknowledge that p = 0.05 is arbitrarily set as the threshold of statistical significance. A study that has a result of p = 0.06 may still be worth implementing into practice, depending on the level of risk to the patient of the intervention in question and the level of benefit they might gain. On the other hand, a highly risky, less beneficial intervention may have a p < 0.05, but you may want an even more stringent threshold of significance before subjecting a patient to that intervention. Just because a p-value exceeds 0.05 doesn’t mean there is no difference. It just means the evidence against the null hypothesis is not as strong.

**There was a “trend” toward statistical significance**

I see this one all the time – a p-value is 0.06, and the author will conclude there was a “trend” toward significance. That is a meaningless statement. A trend implies directionality upon repeated observation, that something is generally changing in a direction over time. A p-value of 0.06 is a single calculation based on observed data. It’s not a trend. The author could truthfully say the results nearly met the arbitrary, *a priori *alpha threshold for statistical significance, but they can’t say it’s a trend.

**A p-value is NOT the probability the null hypothesis is true.**

So, if one found p = 0.04, they might be tempted to think that this means there is only a 4% chance the null hypothesis is true. By definition, a p-value assumes a 100% probability the null hypothesis is true. Rather, a p-value is the probability of getting the observed outcome or more extreme outcome if the null hypothesis is true. A p-value assumes a universe in which the null hypothesis is true and then gives the strength of evidence against it, given the actual observed outcome after the research has been conducted.

**A p-value is NOT the likelihood of the observed result being due to chance.**This is a common misconception. To say, for example, that a p-value of 0.04 means there is a 4% probability the observed outcome was due to chance is not what a p-value measures. That statement implies the p-value is somehow a measure of randomness, which it’s not. Rather, a p-value is the probability of getting the observed (or more extreme) outcome if the truth of the universe is that the null hypothesis is true. To be sure, the p-value is impacted by randomness. It’s just not a measure of randomness.

Concretely, using the prior example study, we would say the probability we would see a 7% mortality difference (or one more extreme), if the truth is that there was no difference in anticoagulant and placebo, is 4%. But it’s not the same to say that 7%, p = 0.04 means there is a 4% probability that the difference of 7% was due to chance or randomness. The p-value is always in reference to the null hypothesis.

**A p-value is NOT the probability that you rejected the null hypothesis when, in fact, it was true.**

Again, this is not quite right. It’s the likelihood of getting the observed result if, in fact, the null hypothesis is true. You see the difference? We usually reject the null hypothesis when p <0.05 by convention. A p-value is not the probability that we reject the null hypothesis incorrectly. Rejecting the null hypothesis is a decision based on an arbitrary cut point, usually 0.05. A p-value gives the level of evidence against the null hypothesis.

## What is the difference: p-value and alpha?

You might have noticed that the usual cut point for statistical significance is a p-value of 0.05, and the typical alpha when calculating a sample size is also 0.05. Sometimes p-value and alpha are confused or conflated. Alpha is the value determined before a study is conducted and is key in determining the sample size of the study. Alpha is the researcher saying, “I am willing to accept that there is a 5% chance of concluding that there is a true observed effect when, in fact, there is not (a type I error).” Alpha is a value that is set and chosen by the researcher; it is a known number. However, a p-value is calculated after the research has been conducted and is not known until the research has been completed and the calculations are done. The researcher hopes the p-value is less than alpha, because then they would conclude the findings are statistically significant and would gleefully reject the null hypothesis.

## What is the difference: p-value and a 95% confidence interval?

We often read a paper, and it will list results like this: difference 7% (95% confidence interval [CI] 6-8%, p=0.04). When the 95%CI and p-value are listed together, it might suggest they are measuring the same thing, just expressed in a different way. But that is not the case. Now you know what the p-value means, but what does the 95%CI mean? Confidence intervals are also measures of significance. Confidence intervals are a way to quantify the uncertainty of a result. A 95%CI is a range of values that we can be 95% confident our true value falls within. In other words, if an experiment in the same population was repeated 20 times, 19 out of 20 times, the result would fall within that numerical range. Use of 95%CI is preferred over the p-value by many journals, as it conveys to the reader more information about the true value (the experimental result, such as a mean or proportion) and the range of values within which the true value falls. If we are considering a mean or proportion and the 95%CI crosses zero, the result is not statistically significant. If we are considering a ratio and the 95%CI crosses 1, the result is not statistically significant. For example, consider our made up PE experiment. If our data showed a 7% difference in mortality for the anticoagulant group vs placebo as such: difference 7% (95%CI -1% to 10%), then this would not be statistically significant. I don’t want to give that drug to my patient based on these results. The 95%CI in this hypothetical is quite broad and crosses zero. Maybe the sample size of the study was very small, allowing for greater imprecision, resulting in a wide 95%CI. The way to “tighten up” the 95%CI in this case would be to increase the sample size. The narrower the range around the true value, the more precision the study has and the more confident we can be that the observed value (our experimental result) falls within that range.

So, p-value and 95%CI are not the same. They are both measures of statistical significance and can help us interpret results of research, but they communicate two different ideas. The 95%CI is a range of values within which we are 95% confident the true value falls, which is pretty easy to explain and comprehend. It’s also why readers and many academic journal editors often prefer it. The p-value is the probability of an observed (or more extreme) outcome if the null hypothesis is true. In contrast to 95%CI, P is really hard to explain and to comprehend. But both are often reported together and convey important information about the statistical significance of a research result. In fact, it takes a full paragraph to convey all the information when a 95%CI and p-value are listed next to a research outcome.

To put this together, let’s go back to our original hypothetical PE study result, reported as: difference 7% (95%CI 6-8%, p=0.04).

Here is what this brief numerical statement means in words. When we studied a group of patients with PE who received a new anticoagulant, the mean mortality rate was 2%. In a group of otherwise similar patients with PE who received placebo, the mean mortality rate was 9%, for a mean mortality difference of 7%. We are 95% confident that a 7% difference falls within the range of 6 to 8%. Assuming there is no difference in drug and placebo, there is only a 4% probability we would have seen a 7% (or greater) mortality difference by chance alone.

## Is there a better way?

In fact, there is. A Bayesian approach incorporates what we already know to determine a prior probability, applies the likelihood ratio, and allows a calculation of the posterior probability. Unlike a p-value, which includes the observed result and all more extreme results – a Bayesian approach gives a precise point estimate.

The authors of the Dirty Dozen article explain the issue of significance well, stating, “The most important foundational issue to appreciate is that there is no number generated by standard methods that tells us the probability that a given conclusion is right or wrong. The determinants of the truth of a knowledge claim lie in combination of evidence both within and outside a given experiment, including the plausibility and evidential support of the proposed underlying mechanism.” We will cover a Bayesian approach in another post.

## For further reading

- Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008 Jul;45(3):135-40. doi: 10.1053/j.seminhematol.2008.04.003. Erratum in: Semin Hematol. 2011 Oct;48(4):302. PMID: 18582619.
- Morgenstern, J. EBM masterclass: What exactly is a P value?, First10EM, October 11, 2021. Available at:

https://doi.org/10.51684/FIRS.83454. - Krzywinski M, Altman N. Significance, P values and t-tests. Nat Methods. 2013 Nov;10(11):1041-2. doi: 10.1038/nmeth.2698. PMID: 24344377.
- Penn State Math and Stats Reviews, https://online.stat.psu.edu/statprogram/reviews
- Here is a nice video by Top Tip Bio: https://youtu.be/ukcFrzt6cHk