How to evaluate A/B test results correctly

Why probability alone is not enough

Bayesian probability answers a narrow question: Which variant is more likely to be better, given the data so far?

It does not answer:

Whether the effect is real
Whether the effect is meaningful
Whether the result is stable
Whether it is safe to act

As shown by A/A tests, high probabilities can occur due to random noise alone. For this reason, probability must never be used in isolation to make decisions.

Use multi-dimensional decision rules

Safe decisions require multiple independent gates. A result must pass all of them.

Probability thresholds by risk level

Use stricter thresholds as business risk increases.

Low-risk changes

(e.g. button text, copy tweaks, minor UI changes)

Probability ≥ 95%
≥ 300 conversions
Runtime ≥ 7 days

Medium-risk changes

(e.g. layout changes, CTA placement)

Probability ≥ 97.5%
≥ 500 conversions
Runtime ≥ 10–14 days

High-risk changes

(e.g. pricing, checkout flow, targeting logic)

Probability ≥ 99%
≥ 1,000 conversions
Runtime ≥ 14–21 days

Require practical significance

Bayesian models are sensitive to very small differences. Without constraints, you can “win” on effects that have no business value.

Always require a minimum effect size.

Example for click or conversion goals:

Require both:

Absolute uplift ≥ +0.5 percentage points
(By how many points does the conversion rate change?)
Relative uplift ≥ +5–10%
(How much better is the variation relative to the baseline)

If the effect does not meet these thresholds, treat it as noise—even if probability is high.

Enforce stability over time

Single-day spikes are unreliable. Real effects persist.

Add a stability rule:

Probability must remain above the threshold for N consecutive days

Suggested values:

Low risk: 2–3 days
Medium risk: 3–5 days
High risk: 5–7 days

If probability drops below the threshold, the clock resets.

Guardrails: validity before confidence

Guardrails are not optimization targets. They are safety checks that determine whether a test result is valid at all.

If a guardrail is violated, the test is blocked or invalidated, regardless of probability.

Core guardrails to use

Traffic integrity guardrail

Detect sample ratio mismatch (SRM)
Significant allocation imbalance → invalidate the test

Minimum data guardrail

Require a minimum number of conversions per variant
Prevents early, noise-driven decisions

Runtime / business-cycle guardrail

Test must span at least one full business cycle
Avoids weekday and seasonality bias

Metric behavior guardrails

These catch failure modes probability alone cannot:

Baseline sanity Baseline conversion rate must stay within historical norms (e.g. ±10%)
Directional consistency Variant beats baseline in ≥ 60–70% of daily snapshots
Stability Probability ≥ threshold for N consecutive days
Absolute lift Absolute uplift ≥ predefined minimum

Summary

Bayesian probability alone is insufficient
False positives are unavoidable
Safe decisions require discipline, not confidence

Bayesian testing is powerful—but only when paired with structured decision rules and guardrails.

For an explanation of why false positives occur in the first place, see False positives in A/A tests.

Why probability alone is not enough​

Use multi-dimensional decision rules​

Probability thresholds by risk level​

Low-risk changes​

Medium-risk changes​

High-risk changes​

Require practical significance​

Enforce stability over time​

Guardrails: validity before confidence​

Core guardrails to use​

Summary​