How to Calculate A/B Test Statistical Significance
Determine if your A/B test results are statistically significant with our free A/B Test Calculator. Enter visitors and conversions to get p-value and confidence.
Steps
Enter control (A) data
Enter the number of visitors (or impressions) and the number of conversions for your control variant — the original, unchanged version. Conversions can be any goal event: purchases, sign-ups, clicks, or form completions.
Enter variant (B) data
Enter the same metrics for your test variant — the new version with the change you are testing. Both variants must run simultaneously (not sequentially) to avoid time-based confounding.
View conversion rates
The calculator shows the conversion rate for each variant (conversions / visitors × 100%) and the relative uplift (how much better or worse variant B performed as a percentage change from control).
Check statistical significance
The p-value and confidence level are shown. A p-value below 0.05 (95% confidence) is the conventional threshold for declaring statistical significance — meaning there is less than a 5% probability the observed difference is due to random chance.
Interpret the result and decide
If the result is significant AND the uplift is practically meaningful (consider your minimum detectable effect), implement the winning variant. If not significant, continue testing until you reach the required sample size, or abandon the hypothesis.
Understanding p-Values and Confidence Levels
A p-value is the probability of observing a difference as large as (or larger than) the one you observed, assuming the null hypothesis (no difference between A and B) is true. A p-value of 0.05 means there is a 5% chance of seeing this result if A and B are actually identical. The 95% confidence level is simply 1 - 0.05 = 95% — it means you are 95% confident that the difference is real. Common misconception: a 95% confidence level does NOT mean 'B is 95% better than A' or 'there is a 95% chance B will perform better in production'. It is a statement about the test's ability to detect differences, not about the effect size.
Running Valid A/B Tests: Common Mistakes to Avoid
The most common A/B testing mistakes that invalidate results: running A and B sequentially rather than simultaneously (seasonal or day-of-week effects confound results); not splitting traffic randomly (if all mobile users see variant A and all desktop users see variant B, you are testing device type not your change); changing the test while it is running (adding new traffic sources, changing the page for other reasons); testing multiple changes at once and attributing the result to one change; stopping the test too early when the result happens to be significant (peeking); and failing to account for novelty effect (users sometimes convert more with any change simply because it is new). Run multiple A/B tests simultaneously on different page elements, never on overlapping user segments.
Frequently Asked Questions
Required sample size depends on: your current conversion rate, the minimum lift you want to be able to detect, your desired confidence level (typically 95%), and your desired statistical power (typically 80%). As a rough guide: to detect a 10% relative lift on a 3% baseline conversion rate at 95% confidence and 80% power, you need approximately 10,000 visitors per variant. To detect a 20% lift on the same baseline, you need about 2,500 per variant. Use a sample size calculator before starting a test to know how long to run it.
Statistical significance tells you whether an observed difference is unlikely to be due to chance. Practical significance (or effect size) tells you whether the difference is large enough to matter to your business. With very large sample sizes, even tiny differences (0.01% conversion rate improvement) can be statistically significant but completely meaningless in practice. Always consider both: a result that is both statistically significant and practically meaningful (above your minimum business threshold) justifies implementing the change.
Peeking means checking test results and making decisions before reaching the predetermined sample size. This inflates your false positive rate significantly — if you check results at multiple interim points and stop when you see a 'significant' result, your actual false positive rate can be much higher than 5% even if you use a 95% confidence threshold. Pre-commit to a sample size and test duration before launching, then look at results only once you have reached that sample size (or use a sequential testing method that accounts for early stopping).