Skip to main content

How to evaluate A/B test results correctly

Why probability alone is not enough

Bayesian probability answers a narrow question: Which variant is more likely to be better, given the data so far?

It does not answer:

  • Whether the effect is real
  • Whether the effect is meaningful
  • Whether the result is stable
  • Whether it is safe to act

As shown by A/A tests, high probabilities can occur due to random noise alone. For this reason, probability must never be used in isolation to make decisions.

Use multi-dimensional decision rules

Safe decisions require multiple independent gates. A result must pass all of them.

Probability thresholds by risk level

Use stricter thresholds as business risk increases.

Low-risk changes

(e.g. button text, copy tweaks, minor UI changes)

  • Probability ≥ 95%
  • 300 conversions
  • Runtime ≥ 7 days

Medium-risk changes

(e.g. layout changes, CTA placement)

  • Probability ≥ 97.5%
  • 500 conversions
  • Runtime ≥ 10–14 days

High-risk changes

(e.g. pricing, checkout flow, targeting logic)

  • Probability ≥ 99%
  • 1,000 conversions
  • Runtime ≥ 14–21 days

Require practical significance

Bayesian models are sensitive to very small differences. Without constraints, you can “win” on effects that have no business value.

Always require a minimum effect size.

Example for click or conversion goals:

Require both:

  • Absolute uplift ≥ +0.5 percentage points
    (By how many points does the conversion rate change?)

  • Relative uplift ≥ +5–10%
    (How much better is the variation relative to the baseline)

If the effect does not meet these thresholds, treat it as noise—even if probability is high.

Enforce stability over time

Single-day spikes are unreliable. Real effects persist.

Add a stability rule:

Probability must remain above the threshold for N consecutive days

Suggested values:

  • Low risk: 2–3 days
  • Medium risk: 3–5 days
  • High risk: 5–7 days

If probability drops below the threshold, the clock resets.

Guardrails: validity before confidence

Guardrails are not optimization targets. They are safety checks that determine whether a test result is valid at all.

If a guardrail is violated, the test is blocked or invalidated, regardless of probability.

Core guardrails to use

Traffic integrity guardrail

  • Detect sample ratio mismatch (SRM)
  • Significant allocation imbalance → invalidate the test

Minimum data guardrail

  • Require a minimum number of conversions per variant
  • Prevents early, noise-driven decisions

Runtime / business-cycle guardrail

  • Test must span at least one full business cycle
  • Avoids weekday and seasonality bias

Metric behavior guardrails

These catch failure modes probability alone cannot:

  • Baseline sanity Baseline conversion rate must stay within historical norms (e.g. ±10%)

  • Directional consistency Variant beats baseline in ≥ 60–70% of daily snapshots

  • Stability Probability ≥ threshold for N consecutive days

  • Absolute lift Absolute uplift ≥ predefined minimum

Summary

  • Bayesian probability alone is insufficient
  • False positives are unavoidable
  • Safe decisions require discipline, not confidence

Bayesian testing is powerful—but only when paired with structured decision rules and guardrails.

For an explanation of why false positives occur in the first place, see False positives in A/A tests.