I may be a pee scientist, but today is brought to you by the letter “P” not the product. “P” is something all journalists, all lay readers of science articles, teachers, medical practitioners, and all scientists should know about. Alas, in my experience many don’t and as a consequence “P” is abused. Hence this post. Even more abused is the word “significant” often associated with P; more about that later.
P is short for probability. Stop! – don’t stop reading just because statistics was a bit boring at school; understanding maybe the difference between saving lives and losing them. If nothing so dramatic, it may save you from making a fool of yourself.
P is a probability. It is normally reported as a fraction (eg 0.03) rather than a percentage (3%). You will be familiar with it when tossing a coin. You know there is a 50% or one half or 0.5 chance of obtaining a heads with any one toss. If you work out all the possible combinations of two tosses then you will see that there are four possibilities, one of which is two heads in a row. So the prior (to tossing) probability of two heads in a row is 1 out 4 or P=0.25. You will see P in press releases from research institutes, blog posts, abstracts, and research articles, this from today:
“..there was significant improvement in sexual desire among those on testosterone (P=0.05)” [link]
So, P is easy, but interpreting P depends on the context. This is hugely important. What I am going to concentrate on is the typical medical study that is reported. There is also a lesson for a classroom.
One kind of study reporting a P value is a trial where one group of patients are compared with another. Usually one group of patients has received an intervention (eg a new drug) and the other receives regular treatment or a placebo (eg a sugar pill). If the study is done properly a primary outcome should have been decided before hand. The primary outcome must measure something – perhaps the number of deaths in a one year period, or the mean change in concentration of a particular protein in the blood. The primary outcome is how these what is measured differs between the group getting the new intervention and the group not getting it. Associated with it is a P value, eg:
“CoQ10 treated patients had significantly lower cardiovascular mortality (p=0.02)” [link]
To interpret the P we must first understand what the study was about and, in particularly, understand the “null hypothesis.” The null hypothesis is simply the idea the study was trying to test (the hypothesis) expressed in a particular way. In this case, the idea is that CoQ10 may reduce the risk of cardiovascular mortality. Expressed as a null hypothesis we don’t assume that it could only decrease rates, but we allow for the possibility that it may increase as well (this does happen with some trials!). So, we express the hypothesis in a neutral fashion. Here that would be something like that the risk of cardiovascular death is the same in the population of patients who take CoQ10 and in the population which does not take CoQ10. If we think about it for a minute, then if the proportion of patients who died of a cardiovascular event was exactly the same in the two groups then the risk ratio (the CoQ10 group proportion divided by the non CoQ10 group proportion) would be exactly 1. The P value, then answers the question:
If the risk of cardiovascular death was the same in both groups (the null hypothesis) was true what is the probability (ie P) that the difference in the actual risk ratio measured from 1 is as large as was observed simply by chance?
The “by chance” is because when the patients were selected for the trial there is a chance that they don’t fairly represent the true population of every patient in the world (with whatever condition is being studied) either in their basic characteristics or their reaction to the treatment. Because not every patient in the population can be studied, a sample must be taken. We hope that it is “random” and representative, but it is not always. For teachers, you may like to do the lesson at the bottom of the page to explain this to children. Back to our example, some numbers may help.
If we have 1000 patients receiving Drug X, and 2000 receiving a placebo. If, say, 100 patients in the Drug X group die in 1 year, then the risk of dying in 1 year we say is 100/1000 or 0.1 (or 10%). If in the placebo group, 500 patients die in 1 year, then the risk is 500/2000 or 0.25 (25%). The risk ratio is 0.1/0.25 = 0.4. The difference between this and 1 is 0.6. What is the probability that we arrived at 0.6 simply by chance? I did the calculation and got a number of p<0.0001. This means there is less than a 1 in 10,000 chance that this difference was arrived at by chance. Another way of thinking of this is that if we did the study 10,000 times, and the null hypothesis were true, we’d expect to see the result we saw about one time. What is crucial to realise is that the P value depends on the number of subjects in each group. If instead of 1000 and 2000 we had 10 and 20, and instead of 100 and 500 deaths we had 1 and 5, then the risks and risk ratio would be the same, but the P value is 0.63 which is very high (a 63% chance of observing the difference we observed). Another way of thinking about this is what is the probability that we will state there is a difference of at least the size we see, when there is really no difference at all. If studies are reported without P values then at best take them with a grain of salt. Better, ignore them totally.
It is also important to realise that within any one study that if they measure lots of things and compare them between two groups then simply because of random sampling (by chance) some of the P values will be low. This leads me to my next point…
The myth of significance
You will often see the word “significant” used with respect to studies, for example:
“Researchers found there was a significant increase in brain activity while talking on a hands-free device compared with the control condition.” [Link]
This is a wrong interpretation: “The increase in brain activity while talking on a hands-free device is important.” or “The increase in brain activity while talking on a hands-free device is meaningful.”
“Significant” does not equal “Meaningful” in this context. All it means is that the P value of the null hypothesis is less than 0.05. If I had it my way I’d ban the word significant. It is simply a lazy habit of researchers to use this short hand for p<0.05. It has come about simply because someone somewhere started to do it (and call it “significance testing”) and the sheep have followed. As I say to my students, “Simply state the P value, that has meaning.”*
For the teachers
- The ability to count and divide
Ask the children what the chances of getting a “Heads” are. Have a discussion and try and get them to think that there are two possible outcomes each equally probable.
Get each child to toss their coin 4 times and get them to write down whether they got a head or tail each time.
Collate the number of heads in a table like.
#heads #children getting this number of heads
If your classroom size is 24 or larger then you may well have someone with 4 heads or 0 (4 tails).
Ask the children if they think this is amazing or accidental?
Then, get the children to continue tossing their coins until they get either 4 heads or 4 tails in a row. Perhaps make it a competition to see how fast they can get there. They need to continue to write down each head and tail.
You may then get them to add all their heads and all their tails. By now the proportions (get them to divide the number of heads by the number of tails). If you like, go one step further and collate all the data. The probability of a head should be approaching 0.5.
Discuss the idea that getting 4 heads or 4 tails in a row was simply due to chance (randomness).
For more advanced classes, you may talk about statistics in medicine and in the media. You may want to use some specific examples about one off trials that appeared to show a difference, but when repeated later it was found to be accidental.
*For the pedantic. In a controlled trial the numbers in the trial are selected on the basis of pre-specifying a (hopefully) meaningful difference in the outcome between the case and control arms and a probability of Type I (alpha) and Type II (beta) errors. The alpha is often 0.05. In this specific situation if the P<0.05 then it may be reasonable to talk about a significant difference because the alpha was pre-specified and used to calculate the number of participants in the study.