How to evaluate A/B test results correctly
Why probability alone is not enough
Bayesian probability answers a narrow question: Which variant is more likely to be better, given the data so far?
It does not answer:
- Whether the effect is real
- Whether the effect is meaningful
- Whether the result is stable
- Whether it is safe to act
As shown by A/A tests, high probabilities can occur due to random noise alone. For this reason, probability must never be used in isolation to make decisions.
Use multi-dimensional decision rules
Safe decisions require multiple independent gates. A result must pass all of them.
Probability thresholds by risk level
Use stricter thresholds as business risk increases.
Low-risk changes
(e.g. button text, copy tweaks, minor UI changes)
- Probability ≥ 95%
- ≥ 300 conversions
- Runtime ≥ 7 days
Medium-risk changes
(e.g. layout changes, CTA placement)
- Probability ≥ 97.5%
- ≥ 500 conversions
- Runtime ≥ 10–14 days
High-risk changes
(e.g. pricing, checkout flow, targeting logic)
- Probability ≥ 99%
- ≥ 1,000 conversions
- Runtime ≥ 14–21 days
Require practical significance
Bayesian models are sensitive to very small differences. Without constraints, you can “win” on effects that have no business value.
Always require a minimum effect size.
Example for click or conversion goals:
Require both:
-
Absolute uplift ≥ +0.5 percentage points
(By how many points does the conversion rate change?) -
Relative uplift ≥ +5–10%
(How much better is the variation relative to the baseline)
If the effect does not meet these thresholds, treat it as noise—even if probability is high.
Enforce stability over time
Single-day spikes are unreliable. Real effects persist.
Add a stability rule:
Probability must remain above the threshold for N consecutive days
Suggested values:
- Low risk: 2–3 days
- Medium risk: 3–5 days
- High risk: 5–7 days
If probability drops below the threshold, the clock resets.
Guardrails: validity before confidence
Guardrails are not optimization targets. They are safety checks that determine whether a test result is valid at all.
If a guardrail is violated, the test is blocked or invalidated, regardless of probability.
Core guardrails to use
Traffic integrity guardrail
- Detect sample ratio mismatch (SRM)
- Significant allocation imbalance → invalidate the test
Minimum data guardrail
- Require a minimum number of conversions per variant
- Prevents early, noise-driven decisions
Runtime / business-cycle guardrail
- Test must span at least one full business cycle
- Avoids weekday and seasonality bias
Metric behavior guardrails
These catch failure modes probability alone cannot:
-
Baseline sanity Baseline conversion rate must stay within historical norms (e.g. ±10%)
-
Directional consistency Variant beats baseline in ≥ 60–70% of daily snapshots
-
Stability Probability ≥ threshold for N consecutive days
-
Absolute lift Absolute uplift ≥ predefined minimum
Summary
- Bayesian probability alone is insufficient
- False positives are unavoidable
- Safe decisions require discipline, not confidence
Bayesian testing is powerful—but only when paired with structured decision rules and guardrails.
For an explanation of why false positives occur in the first place, see False positives in A/A tests.