What is a p-value?
We’ve talked about data and the descriptive statistics. We’ve talked about the measures we use and stratifying by populations and sub populations to assess disparities. We talked about the types of studies and last week we moved into our first statistical test of a hypothesis, the chi-square. Today we will talk about how we know the results of our statistical test are significant by using a p-value.
The p-value is the probability of obtaining the result if the null hypothesis is true. What level of a p-value is significant is actually arbitrary – it simply needs to be selected before the experiment begins. Most commonly researchers will use a p-value of 0.05 or less as significant. But what does this really mean?
Remember when we draw our sample to do our experiment we are estimating a value. Every sample is a different estimate of the true value. This means we can always expect a range of numbers to occur when measuring, and those numbers may be different estimates of the same value. What we want to know with statistics is if those values are different enough that they probably represent different things. The p-value tells us how different those values are.
As a practical example, think about flipping a coin. If you did 100 coin flips with a nickle, and I did 100 coin flips with a quarter, neither of us is likely to get exactly 50 head flips even though we use fair coins. But both the nickle and quarter would produce a number of “heads” flips within the range accepted as normal for a fair coin. Lets say we do this experiment 100 times, so we have 100 counts of the number of heads results. We could plot those counts on a graph and we would end up with something that looked like a normal (bell) curve – with most of the counts falling within a narrow range and fewer results as we move to more “extreme” amounts of heads results.
Now if a third person flipped an unfair coin 100 times, we would expect the number of heads results to be within the range accepted as normal for an unfair coin. This is a different value from a fair coin, so if we plotted 100 experiments of 100 flips of the unfair coin we would expect our normal (bell) curve to end up on a different place on the graph — with the mean number of head flips centered at a larger number.
When we use a p-value, it tells us where our value is on the normal (bell) curve for the null hypothesis. In our coin flip example, the bell curve of a fair coin is our null hypothesis. So when we get a p-value less than 0.05, it tells us the number of times we flipped heads happens in less than 5% of the times that people do 100 coin flips. If this happens, we would likely say the chances that this is a fair coin are very small, so we will reject the null hypothesis in favor of the assumption this is not a fair coin.
The objectives tell us the researcher wanted to know if previous uterine scar dehiscence increases risks. So this is what we are likely to see in the study:
Null Hypothesis: Women with a previous uterine scar dehiscence have the same rate of adverse perinatal outcomes as women without a previous uterine scar dehiscence.
Alternative Hypothesis: Women with a previous uterine scar dehiscence have the same rate of adverse perinatal outcomes as women without a previous uterine scar dehiscence.
Now look at the results and notice the p-values. Essentially, this study uses a two by two table like we did with chi-squares, counting the number of times each of these events happens for women in two groups – those with and those without previous uterine scar dehiscence. The p-value then tests the group with the previous dehiscence against the group without to see if the values are different.
A p-value less than 0.001 means that less than 0.1% of the time, you would expect to get these counts in a group of women without a previous uterine dehiscence. This is a small enough value for the researchers to feel comfortable saying this sample gives evidence the rate of preterm delivery, low birth weight and peripartum hysterectomy are not the same in women with and without previous dehisence.
Birth Worker Survey
In the demographic questions, we asked about the level of education. We can split the group into those who have already completed their college degree, and those who are still in process to see if there are any differences in attitudes about birth.
The first statement to analyze is: Evidence Based Practice doesn’t apply to midwifery because every birth is different. Respondents overwhelmingly disagreed with this statement (85% total). Maybe there is something about those who agreed that is different from those who disagreed?
So my null hypothesis would basically be there is no difference in belief about use of evidence based practice between birth professionals who have and have not already completed college. My alternative hypothesis is that there is a difference.
When I run the analysis I get a p-value of 0.575. This means if the null hypothesis is true, I can expect to get a result similar to what I found about 57% of the time. This data does not give me evidence to reject my null hypothesis, so I fail to reject.
A question asked only to those who work as doulas was if doula was currently a job or a hobby. Splitting the 21 doulas into those who consider it their job (14) and those who consider it a hobby or goal (7) gives us another variable to check out the data. When I analyze the same statement about evidence based practice I find a p-value of .147. Do you know what this means? If we assume the 0.05 cut off generally used, again I will fail to reject the null that there is a difference between the two groups.
What about something specific to doulas. We asked the doula how likely they were to recommend others work as doulas. Results were split, with 70% saying yes they were likely or very likely to recommend others work as a doula. When the doulas are split into those that doula as a job, and those for whom it is still a hobby we find 75% of hobby doulas would recommend the work, while 66% of employment doulas would recommend the job. The P-value is 0.679, so is this a significant difference or do we assume the groups are the same?