If you've ever read a wild headline like, "Study Shows Chewing Rocks Prevents Cancer," you've probably wondered how that could be possible. If you look closer at this type of article you may find that the sample size for the study was a mere handful of people. If one person in a group of five chewed rocks and didn't get cancer, does that mean chewing rocks prevented cancer?
Definitely not. The study for such a conclusion doesn't have statistical significance—though the study was performed, its conclusions don't really mean anything because the sample size was small.
So what is statistical significance, and how do you calculate it? In this article, we'll cover what it is, when it's used, and go step-by-step through the process of determining if an experiment is statistically significant on your own.
What Is Statistical Significance?
As I mentioned above, the fake study about chewing rocks isn't statistically significant. What that means is that the conclusion reached in it isn't valid, because there's not enough evidence that what happened was not random chance.
A statistically significant result would be one where, after rigorous testing, you reach a certain degree of confidence in the results. We call that degree of confidence our confidence level, which demonstrates how sure we are that our data was not skewed by random chance. More specifically, the confidence level is the likelihood that an interval will contain values for the parameter we're testing.
There are three major ways of determining statistical significance:
- If you run an experiment and your p-value is less than your alpha (significance) level, your test is statistically significant
- If your confidence interval doesn't contain your null hypothesis value, your test is statistically significant
- If your p-value is less than your alpha, your confidence interval will not contain your null hypothesis value, and will therefore be statistically significant
This info probably doesn't make a whole lot of sense if you're not already acquainted with the terms involved in calculating statistical significance, so let's take a look at what it means in practice.
Say, for example, that we want to determine the average typing speed of 12-year-olds in America. We'll confirm our results using the second method, our confidence interval, as it's the simplest to explain quickly.
First, we'll need to set our p-value, which tells us the probability of our results being at least as extreme as they were in our sample data if our null hypothesis (a statement that there is no difference between tested information), such as that all 12-year-old students type at the same speed) is true. A typical p-value is 5 percent, or 0.05, which is appropriate for many situations but can be adjusted for more sensitive experiments, such as in building airplanes. For our experiment, 5 percent is fine.
If our p-value is 5 percent, our confidence level is 95 percent—it's always the inverse of your p-value. Our confidence level expresses how sure we are that, if we were to repeat our experiment with another sample, we would get the same averages—it is not a representation of the likelihood that the entire population will fall within this range.
Testing the typing speed of every 12-year-old in America is unfeasible, so we'll take a sample—100 12-year-olds from a variety of places and backgrounds within the US. Once we average all that data, we determine the average typing speed of our sample is 45 words per minute, with a standard deviation of five words per minute.
From there, we can extrapolate that the average typing speed of 12-year-olds in America is somewhere between $45 - 5z$ words per minute and $45 + 5z$ words per minute. That's our confidence interval—a range of numbers we can be confident contain our true value, in this case the real average of the typing speed of 12-year-old Americans. Our z-score, ‘z,' is determined by our confidence value.
In our case, given our confidence value, that would look like $45 - 5(1.96)$ and $45 + 5(1.96)$, making our confidence interval 35.2 to 54.8.
A wider confidence interval, say with a standard deviation of 15 words per minute, would give us more confidence that the true average of the entire population would fall in that range ($45± \bo{15}(1.96)$), but would be less accurate.
More importantly for our purposes, if your confidence interval doesn't include the null hypothesis, your result is statistically significant. Since our results demonstrate that not all 12-year-olds type the same speed, our results are significant.
One reason you might set your confidence rating lower is if you are concerned about sampling errors. A sampling error, which is a common cause for skewed data, is what happens when your study is based on flawed data.
For example, if you polled a group of people at McDonald's about their favorite foods, you'd probably get a good amount of people saying hamburgers. If you polled the people at a vegan restaurant, you'd be unlikely to get the same results, so if your conclusion from the first study is that most peoples' favorite food is hamburgers, you're relying on a sampling error.
It's important to remember that statistical significance is not necessarily a guarantee that something is objectively true. Statistical significance can be strong or weak, and researchers can factor in bias or variances to figure out how valid the conclusion is. Any rigorous study will have numerous phases of testing—one person chewing rocks and not getting cancer is not a rigorous study.
Essentially, statistical significance tells you that your hypothesis has basis and is worth studying further. For example, say you have a suspicion that a quarter might be weighted unevenly. If you flip it 100 times and get 75 heads and 25 tails, that might suggest that the coin is rigged. That result, which deviates from expectations by over 5 percent, is statistically significant.
Because each coin flip has a 50/50 chance of being heads or tails, these results would tell you to look deeper into it, not that your coin is definitely rigged to flip heads over tails. The results are statistically significant in that there is a clear tendency to flip heads over tails, but that itself is not an indication that the coin is flawed.
What Is Statistical Significance Used For?
Statistical significance is important in a variety of fields—any time you need to test whether something is effective, statistical significance plays a role.
This can be very simple, like determining whether the dice produced for a tabletop role-playing game are well-balanced, or it can be very complex, like determining whether a new medicine that sometimes causes an unpleasant side effect is still worth releasing.
Statistical significance is also frequently used in business to determine whether one thing is more effective than another. This is called A/B testing—two variants, one A and one B, are tested to see which is more successful.
In school, you're most likely to learn about statistical significance in a science or statistics context, but it can be applied in a great number of fields. Any time you need to determine whether something is demonstrably true or just up to chance, you can use statistical significance!
How to Calculate Statistical Significance
Calculating statistical significance is complex—most people use calculators rather than try to solve equations by hand. Z-test calculators and t-test calculators are two ways you can drastically slim down the amount of work you have to do.
However, learning how to calculate statistical significance by hand is a great way to ensure you really understand how each piece works. Let's go through the process step by step!
Step 1: Set a Null Hypothesis
To set up calculating statistical significance, first designate your null hypothesis, or H0. Your null hypothesis should state that there is no difference between your data sets.
For example, let's say we're testing the effectiveness of a fertilizer by taking half of a group of 20 plants and treating half of them with fertilizer. Our null hypothesis will be something like, "This fertilizer will have no effect on the plant's growth."
Step 2: Set an Alternative Hypothesis
Next, you need an alternative hypothesis, Ha. Your alternative hypothesis is generally the opposite of your null hypothesis, so in this case it would be something like, "This fertilizer will cause the plants who get treated with it to grow faster."
Step 3: Determine Your Alpha
Third, you'll want to set the significance level, also known as alpha, or α. The alpha is the probability of rejecting a null hypothesis when that hypothesis is true. In the case of our fertilizer example, the alpha is the probability of concluding that the fertilizer does make plants treated with it grow more when the fertilizer does not actually have an effect.
An alpha of 0.05, or 5 percent, is standard, but if you're running a particularly sensitive experiment, such as testing a medicine or building an airplane, 0.01 may be more appropriate. For our fertilizer experiment, a 0.05 alpha is fine.
Your confidence level is $1 - α(100%)$, so if your alpha is 0.05, that makes your confidence level 95%. Again, your alpha can be changed depending on the sensitivity of the experiment, but most will use 0.05.
Step 4: One- or Two-Tailed Test
Fourth, you'll need to decide whether a one- or two-tailed test is more appropriate. One-tailed tests examine the relationship between two things in one direction, such as if the fertilizer makes the plant grow. A two-tailed test measures in two directions, such as if the fertilizer makes the plant grow or shrink.
Since in our example we don't want to know if the plant shrinks, we'd choose a one-tailed test. But if we were testing something more complex, like whether a particular ad placement made customers more likely to click on it or less likely to click on it, a two-tailed test would be more appropriate.
A two-tailed test is also appropriate if you're not sure which direction the results will go, just that you think there will be an effect. For example, if you wanted to test whether or not adding salt to boiling water while making pasta made a difference to taste, but weren't sure if it would have a positive or negative effect, you'd probably want to go with a two-tailed test.
Step 5: Sample Size
Next, determine your sample size. To do so, you'll conduct a power analysis, which gives you the probability of seeing your hypothesis demonstrated given a particular sample size.
Statistical power tells us the probability of us accepting an alternative, true hypothesis over the null hypothesis. A higher statistical power gives lowers our probability of getting a false negative response for our experiment. In the case of our fertilizer experiment, a higher statistical power means that we will be less likely to accept that there is no effect from fertilizer when there is, in fact, an effect.
A power analysis consists of four major pieces:
- The effect size, which tells us the magnitude of a result within the population
- The sample size, which tells us how many observations we have within the sample
- The significance level, which is our alpha
- The statistical power, which is the probability that we accept an alternative hypothesis if it is true
Many experiments are run with a typical power, or β, of 80 percent. Because these calculations are complex, it's not recommended to try to calculate them by hand—instead, most people will use a calculator like this one to figure out their sample size.
Conducting a power analysis lets you know how big of a sample size you'll need to determine statistical significance. If you only test on a handful of samples, you may end up with a result that's inaccurate—it may give you a false positive or a false negative. Doing an accurate power analysis helps ensure that your results are legitimate.
Step 6: Find Standard Deviation
Sixth, you'll be calculating the standard deviation, $s$ (also sometimes written as $σ$). This is where the formula gets particularly complex, as this tells you how spread out your data is. The formula for standard deviation of a sample is: $$s = √