Written by Clay Smith
Spoon Feed
This article helps us be clearer with our language as we write about and explain statistics.
Why does this matter?
This article is meant for potential authors who want to publish in Chest, which is not something we would usually cover. But it was so helpful, I just couldn’t help myself.
Statistical pearls for mere mortals
Here are the stats pearls I take away.

There is a right and wrong way to do a study. A RCT should follow CONSORT guidelines. Observational studies should follow STROBE guidelines. See the Equator Network for more. If the study you’re reading doesn’t mention or seem to follow these, beware, you are probably reading a poorly done paper.

There should be a clear study question and clear explanation of the analysis used to answer the primary question. If you can’t find this, throw the paper in the trash.

P values – We don’t accept the null hypothesis; we either reject or do not reject it. P values just above 0.05 should not be described as “a trend.” A trend indicates something is moving. A p value over 0.05 is not a trend; it just means an endpoint did not meet that measure of statistical significance.

A p value of 0.03 does not mean there is a 3% probability that results are due to chance. It doesn’t quantify the probability of a hypothesis. Rather, it is the probability of rejecting the null hypothesis when it is really true. Similarly, a 95% CI does not mean there is a 95% chance the true value falls in that range of numbers; it means that if the experiment was repeated in different samples, there is a 95% chance the true parameter value would fall in that range. At first glance, this seems like splitting hairs, but smart people swear it’s not. You stats wonks put something in the comments for the rest of us.

Statistical significance does not equal clinical significance. For example, you may find a statistically significant ½ point difference on a pain scale, but is decreasing pain from a 10 to a 9.5 really making your patient feel better?

Multivariate and propensity analyses help mitigate but do not remove the impact of confounders and cannot act as a substitute for a randomized controlled trial when it comes to determining causality. For instance, we shouldn’t say, “multivariate analyses removed confounding.”

It is more helpful to discuss the clinical impact of a result, rather than just the statistical facts. For example, one could report the sensitivity, specificity, and AUC for a test. But more relevant would be to report the stats facts and clinical import. For example, if the sensitivity for appendix ultrasound increased, X% of CT scans could be avoided.

The authors say we should avoid saying “may” or “might.” Oh boy…I do that all the time when I don’t want to convey causality. The authors note that saying a hypothesis “may” be true is the reason we do a study. It is also always a true statement unless a hypothesis is proven to false, which the authors point out is very difficult in science. Instead, we should say, “There is evidence that X was associated with Y, and a RCT is needed.” I may change the way I write in the future 🙂 .
Source
Statistical Analysis and Reporting Guidelines for CHEST. Chest. 2020 Jul;158(1S):S3S11. doi: 10.1016/j.chest.2019.10.064.
Regarding 95%CI –
A 95% CI means that if an experiment was repeated 100 times, and each of those times you were sampling the same population in the exact same way, 95 of the confidence intervals produced by those experiments would contain the true (population value). Admittedly, it can be pretty hard to wrap your head around this definition, since it is effectively defining something by itself. There’s a couple of reasons, though, why understanding the real definition is actually important compared to the common interpretation of “There’s a 95% chance the true value falls in this range.” First, the actual interpretation doesn’t assume as much as people tend to think it does about the accuracy of the point estimate (sample value) compared to the population value (true value). We can never know the true value unless we performed an experiment on 100% of people in a population, which never happens in real life. Practically, this means we use sampling to estimate means/ORs/etc. and their confidence intervals from the population. Unfortunately, sampling is inexact and always somewhat biased – it’s virtually impossible in biomedical science to sample the same population in exactly the same way multiple times with consistency (the reasons for this are myriad, just take my word for it). Because of this, the scenario described in the true definition of a CI (where the experiment is repeated 100 times and 95% of the confidence intervals contain the true value) is wholly hypothetical – no biomedical study could ever actually be repeated 100 times with the exact same sampling of the exact same population. A good example of this is to look at the forest plot in a metaanalysis. Say there are 20 studies purporting to test a single hypothesis: if it were true that they were sampling the same population in the same way, then by the definition of confidence intervals there should be at least 1 point on average contained by the confidence intervals of 19 of them. In many meta analyses the forest plots don’t look like this though: studies are all over the place with nonoverlapping CIs. The reason is heterogeneity, both from differences in the way sampling is conducted in each study, to the fact that many studies purporting to ask the same question about the same population actually enroll slightly different populations in a biased way. This is true at every level of evidence – even RCTs. The upshot of all this is that not only is it incorrect to say the true value will be in the CI 95% of the time, it is almost invariably true that the actual chance of the true value being in your 95% CI is something much less than 95%. For that to occur, all of the assumptions about repeated sampling of the same population done in the exact same way multiple times would have to hold true, which as discussed never actually happens in real life, and because ostensibly you DID NOT resample the population a bunch of times. You are reading about one experiment done once, and since the actual definition of CI revolves around repeated sampling, it actually says much less about the validity of your point estimate in this single experiment than most people believe. As a takeaway, just look at the CI as a general and imperfect measure of uncertainty – wider is more uncertain, narrow less so – but no matter what you cannot make inferences about the true value’s chance of being within a certain bounds without repeating the experiment….again and again and again.
Second (my own biggest waytonitpickystatsnerd pet peeve), people tend to assume that values towards the center of a confidence interval are more likely as the true value than those at the borders, and this isn’t true. If a study shows an OR of 4.0 with 95%CI 1.1 – 6.0, the value of the true OR is no more or less likely to be 4.0 than it is 1.1, or 6.0 for that matter. This is due to similar reasons as what is described above regarding repeating the experiment 100 times – in the true definition of a CI, it’s wholly plausible that 1.1 is the true value, and it just happens that in this experiment it was found on the tail of the CI drawn. There’s nothing in the definition to say that in 94 of the next 99 replications the number all of those CIs contain is 1.1 any lessso than 4.0, since you have not actually done those experiments yet. There IS a distant cousin of the confidence interval, called a highestdensity interval (HDI) where central values are more likely than those on the tails (and therefore is much more helpful in interpreting uncertainty). Unfortunately you won’t see this often its an output of Bayesian statistics and not Frequentist statistics (aka what most think of as just “statistics”, since the Bayesian family of stats methods are rarely used in biomedical science).
Hope that long winded answer is slighlty less clear than mud.
nick harrison, MD MSc, indiana university EM
Finally, someone pointed out a serious problem throughout the literature!
"P values just above 0.05 should not be described as “a trend.” A trend indicates something is moving. A p value over 0.05 is not a trend; it just means an endpoint did not meet that measure of statistical significance."
This is pervasive in medical articles and throughout the blogs. Even my idols in the EMA/EMRap world do this every single month. "Trend" gets used simply when the number you want is not significantly larger than another. This is NOT correct. A trend only indicates a movement in value over time. In a single article multiple numbers will be not significant different, but an author will pick the ones that they want & call it a trend – and the ones they don’t like get ignored.
I would suggest article reviewers do a simple word search for "trend", "tends to" … If this is not a proper use of the work I would reject the paper. A tip to an aspiring researcher out there – Take a year of articles from a journal (maybe the Annals of EM), text search the entire contents for "trend" and similar terms. Review any ones that pop up. I hypothesize you will find it is commonly misused. I also think you might find that in the same paper, results unfavorable to the author hypothesis will not be listed as trends.
Thanks Clay – first time I have seen this brought up.
Geoffrey Geer, MD Colorado