CRO

Shopify A/B Testing: How to Run Experiments That Actually Tell You Something

Most Shopify A/B tests fail not because the ideas are wrong, but because the methodology is. Low sample sizes, stopping tests early when they look good, measuring the wrong metric - these produce confident-looking results that don't replicate. Here's how to run tests that actually tell you something.

The Shopify ecosystem sells A/B testing as easier than it is. Apps promise "no-code testing" and show dashboards full of percentage lifts. What they don't show is whether those lifts are statistically valid, whether the test ran long enough to account for weekly traffic patterns, or whether you've been measuring the metric that actually maps to revenue.

Bad A/B testing is worse than no A/B testing. It produces false confidence in changes that don't work and, in some cases, causes merchants to ship variants that actively harm conversion rate while the data said they were winners.

The statistics you need to understand before running any test

You don't need a statistics degree. You need to understand three concepts:

Statistical significance

A test result is statistically significant when the probability that the observed difference is due to chance falls below your threshold - typically 95% confidence (p < 0.05). At 95% confidence, you're accepting a 1-in-20 chance that your result is a false positive.

Most A/B testing tools report this automatically. The mistake is interpreting "95% confidence" as "definitely a winner." It means the result is unlikely to be random - not that the effect will persist at the same magnitude at scale.

Sample size

The required sample size for a valid test depends on three inputs: your baseline conversion rate, the minimum effect you want to detect, and your significance threshold. A store converting at 2% needs more visitors per variant to detect a 0.3% lift than a store converting at 5% does.

Use a sample size calculator before starting any test. Evan Miller's calculator (evanbmiller.com/ab-testing) is the standard. For most Shopify stores, detecting a 15–20% relative lift in conversion rate requires 2,000–5,000 visitors per variant at minimum. For smaller lifts, you need more. Many Shopify stores don't have sufficient traffic to run product page tests - the honest answer is that they should improve the page based on heuristic analysis rather than running underpowered tests that will produce noise.

Minimum detectable effect

Before you run a test, decide: what's the smallest improvement that would be commercially meaningful? A 0.1% lift in conversion rate on a store doing $10k/month isn't worth the investment in testing infrastructure. A 15% lift is. Setting your minimum detectable effect before the test prevents you from fishing for significance after seeing the data - a practice called p-hacking that invalidates results.

How to write a hypothesis

A good hypothesis has three parts:

"If [change], then [measurable outcome], because [reason based on user behaviour]."

Bad hypothesis: "If I change the button colour to orange, conversion rate will increase."
Good hypothesis: "If I make the ATC button full-width on mobile and increase its height to 52px, mobile add-to-cart rate will increase, because current tap target analysis shows the button is below minimum touch target size on 320px viewports and is being missed on first attempt."

The difference: the good hypothesis identifies a specific user behaviour problem, proposes a specific change, and predicts a specific metric. If the test shows no effect, you know the assumption about tap target size was wrong. If it shows a positive effect, you've confirmed the mechanism, not just the outcome.

What to test first: the hierarchy of impact

Test high-traffic pages with high-impact elements first. The testing hierarchy for most Shopify stores:

  1. Product pages - highest traffic, directly tied to add-to-cart. The buy box layout, ATC button design, social proof placement, and variant selector type are all high-leverage tests.
  2. Collection pages - product card layout, quick-add functionality, number of products per page, sort order
  3. Homepage - hero message, CTA copy, above-fold content. Note: homepage traffic is mixed-intent and harder to draw conclusions from than product page traffic
  4. Cart page / drawer - free shipping threshold messaging, trust signals, upsell placement
  5. Checkout - limited on standard Shopify; full control on Plus via Checkout Extensibility

Don't test low-traffic pages. A contact page test or a blog post test will never reach significance on a normal Shopify store's traffic volume.

Tools for Shopify A/B testing

Convert.com

The best option for most serious Shopify A/B testing programmes. Proper frequentist and Bayesian statistical engines, solid Shopify compatibility, good documentation, and a sensible pricing model. More setup than no-code tools but produces trustworthy results. Starts around $99/month.

VWO (Visual Website Optimizer)

Enterprise-grade. More powerful than Convert for complex multi-page experiments and personalisation, but significantly more expensive and complex to set up correctly. Appropriate for stores doing $5M+ revenue where the testing programme justifies the overhead.

AB Tasty

Mid-market option with a strong visual editor. Good Shopify compatibility. Pricing is quote-based and tends toward enterprise territory. Worth evaluating if you need strong personalisation alongside A/B testing.

Shopify-native "theme A/B testing"

Shopify allows you to publish multiple theme versions and split traffic between them. This is useful for full-theme comparisons - testing a redesigned theme against the current one - but not for granular element-level tests. The reporting is minimal; you'd need to use GA4 or a separate analytics tool to measure outcomes properly.

Google Optimize

Deprecated and shut down in September 2023. If you're reading a guide that recommends it, the guide is out of date.

A note on app-based "A/B testing": Several Shopify apps promise A/B testing functionality but implement it as a redirect or a cookie-based content swap that isn't properly isolated. These implementations can contaminate results through bot traffic, session stitching errors, and improperly excluded internal traffic. Use a purpose-built testing tool.

Test duration

Two rules:

Run for a minimum of two weeks. Traffic patterns vary by day of the week. A test that runs Monday through Thursday shows weekday behaviour, not your store's full traffic distribution. Two complete week cycles is the minimum for seasonal flattening.

Run for full business cycles, not until you hit significance. If your store has a known promotional cycle - a sale on the last weekend of the month, a weekly email send that spikes traffic on Thursdays - your test needs to run through at least one complete cycle of each. Stopping when the tool says "95% confident" on day 4 of a two-week test is the single most common A/B testing mistake.

What invalidates a test

  • Running a promotion during the test. A discount code, a flash sale, or a free shipping threshold change affects all variants simultaneously but may affect them differently. Pause the test or discard the results from the promotional period.
  • Making site changes during the test. Deploying a new theme update, adding a new app, or changing product pricing mid-test contaminates the results.
  • Seasonal traffic shifts. A test that crosses a major retail event (Black Friday, Valentine's Day) will have different traffic characteristics at the start and end. The intent and conversion behaviour of seasonal shoppers is different from baseline shoppers.
  • Insufficient traffic separation. If the same visitor sees both variants (because cookies are being cleared or they're switching devices), your sample is contaminated. Ensure your testing tool's visitor identification is properly configured.

Reading results correctly

When a test completes:

  • Look at the primary metric first. If you were testing add-to-cart rate, look at add-to-cart rate. Not revenue per visitor, not time on page - the metric your hypothesis predicted.
  • Check secondary metrics for context. A variant with a higher add-to-cart rate but lower order completion rate may be attracting less committed cart additions. Revenue per visitor is the sanity check.
  • Consider practical significance alongside statistical significance. A 0.05% lift at 99% confidence isn't worth shipping if the implementation cost is high. A 20% lift at 90% confidence on a high-traffic page may be worth shipping and running a confirmatory test.
  • Don't declare losers without sufficient power. A variant that shows no statistically significant difference isn't necessarily equal to the control - it may mean you didn't have enough traffic to detect the difference. Check your power against your original sample size calculation.

Shipping the winner

The most underestimated part of a testing programme is the implementation pipeline. A test produces a winner - now what? If the winner was a CSS change, it needs to be implemented directly in the theme rather than living forever as a JS overlay via the testing tool. If it was a Liquid layout change, it needs a proper theme deployment.

Testing tools that run indefinitely as live code on your production store are a performance problem. Each test adds JavaScript that executes on every page load, introduces the possibility of flickering (where the control briefly shows before the variant loads), and adds a dependency on a third-party script for something that should be permanent code. Build a process for promoting winners out of the testing tool and into the theme.

When to hire versus do it yourself

Running a basic A/B test on button copy or a single layout change is within reach of most Shopify store owners with a testing tool and a patient approach. Running a sustained programme - 2–4 tests per month, with proper hypothesis design, implementation for complex variants, statistical rigour, and a winner deployment pipeline - is a different proposition.

The value of a structured programme over ad-hoc testing is compound: each test informs the next, a backlog of prioritised hypotheses prevents testing arbitrary things, and proper tracking means you're building a knowledge base about your specific customers rather than just chasing individual lifts.

Filip Rastovic
Filip Rastovic
Shopify Developer & CRO Specialist · Stargazer Studio

Want a structured A/B testing programme on your store?

The A/B Testing & CRO retainer covers hypothesis design, test implementation, statistical analysis, and winner deployment - from $1,500/month.

Book a free call More articles
Filip Rastovic
Book a Call Get started today