When a client comes to me and says "we want off Shopify", the conversation usually splits into two paths. The trivial path: their store is a few dozen SKUs, a few hundred orders, content marketing on top. Any of the off-the-shelf migration tools (Cart2Cart, LitExtension, the official S2W importer) will do an acceptable job, and the move ships in a couple of days.
The non-trivial path is the one this post is about. The client had:
- 75 active products with 144 variations and 391 product images
- 6,441 customers, going back several years
- 7,476 historical orders representing £406,695.36 in trading
- 10,117 line items across those orders
- 1,096 discount codes, some of them one-time-use codes already issued to past customers
- 45 blog posts under
/blogs/news(Shopify does not export blog content)
When the historical financials matter to a client (accounting reconciliation, customer support continuity, ad-platform attribution windows), "approximately correct" is not the standard. Every line had to land. So I wrote a custom pipeline. This is what I built and what I learned.
Why custom and not a plugin
The off-the-shelf importers are perfectly fine when none of the following are true:
- You need every order to reconcile to the penny against your Shopify finance reports.
- You have one-time-use discount codes already sent to customers that must still work post-cutover.
- Your blog content has internal links, embedded images, and SEO juice you cannot afford to lose.
- Your product option names exceed 28 characters (a real WooCommerce slug limit that bites people).
- You want a sandboxed dry run before any write hits the production database.
The third-party tools either silently lose information against these constraints or refuse to map data they cannot interpret. When the gap matters, you build a pipeline that matches your data shape exactly.
The architecture, in one diagram
Shopify CSVs + crawled HTML
|
v
[1] Extract Python normalisers per entity write JSON
|
v
[2] Transform Fuzzy match line items to product / variation IDs
|
v
[3] Load PHP loaders invoked through wp eval-file,
running inside the WP / WC context
|
v
[4] Validate Five validators compare source aggregates to DB state
|
v
[5] Fix loop Re-run failed records until green or three
stale iterations, then write manual_review.json
Five stages, executed in order, each writing its output to disk so any step can be re-run independently.
Two passes, one codebase
The first thing I did before writing any loader code was set up a local sim environment. Docker Compose, mariadb:11.4, wordpress:latest, a wp-cli sidecar. I pulled a mysqldump of the live Kinsta database and imported it into the local mariadb. I installed the same plugin set as live.
Then I wired the pipeline so that MIGRATION_TARGET=sim and MIGRATION_TARGET=live swap only the DB host and the wp-cli invocation. Nothing else.
Every bug I found surfaced in sim. The live pass ran the identical code with one env var flipped, and produced identical results. This is not novel, but it is non-negotiable when the live target has real revenue history attached.
The idempotent contract
Every loader respects the same rule. Each imported record gets a meta key that links it back to its Shopify source ID:
- Products get
_shopify_product_id - Variations get
_shopify_variation_id - Customers get
_shopify_customer_id(on the user record) - Orders get
_shopify_order_id(on the WC order) - Coupons get
_shopify_discount_id - Blog posts get
_shopify_blog_slug - Attachments get
_shopify_image_url
Before inserting, every loader checks for an existing record with the matching meta value. If found, it updates in place. If not, it inserts.
The practical effect: if the pipeline crashes halfway, you re-run it. If a single order failed validation, the fix loop re-runs only that order. Nothing duplicates, nothing strands. Re-running a clean pipeline on a clean DB produces zero changes, and that "zero changes" check is itself a green-light signal before flipping to live.
The technical calls that earned their keep
Bulk PHP loaders via wp eval-file
My first attempt used the WooCommerce REST API from Python. I measured it. Spawning 7,476 REST calls would have taken hours of HTTP overhead, plus I had to handle auth, retries, and rate limiting myself. I threw that branch away and rewrote the loaders as PHP scripts invoked through wp eval-file. Each loader runs once, stays in the WP / WC PHP memory the whole time, and gets WC's internal hooks and HPOS sync for free. End-to-end speedup: roughly 50x.
The loader runner in Python is a thin shim that does ssh ... wp eval-file ... and parses the output. That is the whole interface.
HPOS-aware order writes
WooCommerce's High-Performance Order Storage moves orders out of wp_posts into wp_wc_orders and a small set of related tables. The data is still there if you write raw SQL, but the public WC_Order API handles the dual-write to lookup tables and meta correctly. I went through WC_Order and $order->save(), never raw inserts to wp_wc_orders, even when the table was right there.
If HPOS is enabled and you write raw SQL to wp_wc_orders, you will silently lose data in the lookup tables and the order search index will lie to you for weeks. Use the API.
Local product attributes, not global taxonomies
Shopify lets product option names be anything. Free-form strings. We had option names that exceeded WooCommerce's 28-character slug limit, which breaks global product attribute taxonomies. WC's solution is "use a shorter name", which is not actionable on imported data.
The workaround: use local (per-product) attributes instead of global taxonomies. The downside is they are not aggregated across products, so a "filter by 500mg" facet across the catalog has to be rebuilt with a different mechanism. The upside is the import actually works. For this client, the trade was worth it.
Email suppression during order import
7,476 orders means 7,476 potential customer notifications, plus admin notifications, plus payment-method-specific emails. Sending those during a migration is a great way to get customers asking "why am I being told my order from 2024 is now Processing".
The cleanest approach I found is a one-line filter:
add_filter( 'pre_wp_mail', '__return_false' );
Drop this in the order loader, do the work, remove the filter at the end. WC's notification dispatch will fire normally but every individual mail attempt returns false at the lowest level.
Content-hash image dedup
Shopify's product images are referenced by URL. The same physical image might appear on multiple products with different URLs (variants, alt sizes). I did not want to upload the same JPEG 17 times.
The dedup key was _shopify_image_url on the WP attachment. Before downloading and uploading, check the meta. If the URL has been imported, attach the existing attachment ID to the new product. The 391 product images on disk represent 391 unique URLs, not 391 unique uploads.
Blog crawled, not imported
Shopify exports do not include blog content. There is no blogs.csv. I wrote a small crawler that walks /blogs/news, parses the article HTML, pulls out title / date / author / body / hero image, and writes it to JSON. The loader is the same shape as the others. 45 posts captured, including their featured images.
Fuzzy line-item matching
The thorniest data problem in any commerce migration is historical line items. Shopify stores line items as name + price + quantity + product_id where the product_id is a Shopify ID that does not exist in WC. Worse, the name field on a Shopify line item is whatever was on the product at order time, which may have been edited, renamed, or deleted since.
I used RapidFuzz token_set_ratio with a 90 threshold to match line item names to product / variation IDs. 64.9% auto-matched. The remaining 35.1% saved as name + price + qty line items with no product linkage, which WooCommerce supports natively. The customer-facing order pages and admin views all render correctly. Revenue reconciliation works because line item totals come from the saved total field, not from the product price lookup.
For accounting purposes, an unlinked line item is fine. For analytics purposes ("what are my best-selling SKUs over the last three years"), it is a slight gap. I documented the gap.
The validation strategy
Five validators ran after every load. Each compared source CSV aggregates to DB state and emitted a diff_*.json report.
diff_products.json: source product count, names, prices vs WC statediff_customers.json: email address sets, billing country distributiondiff_orders.json: order count, status distribution, revenue totalsdiff_coupons.json: coupon code set, expiry datesdiff_blog.json: post count, slug set, featured-image presence
Any validator emitting a non-empty diff blocked the pipeline. The fix loop then re-ran the relevant loader for the missing or mismatched records, up to three iterations of zero progress before flagging for manual review.
The acceptance bar was zero diffs across all five validators, plus one final re-run of the pipeline producing zero changes (the idempotency check). Only then did I flip MIGRATION_TARGET=live.
The final numbers
The acceptance criteria were met across every entity:
- Products: 75 / 75 (100%)
- Variations: 144 / 144 (100%)
- Product images: 391 / 391 (100%)
- Customers: 6,441 / 6,441 (100%)
- Coupons: 1,096 / 1,097 (99.9%, 1 case collision, unavoidable)
- Orders: 7,476 / 7,476 (100%)
- Order line items: 10,117 / 10,117 (100%)
- Revenue: £406,695.36 / £406,695.36 (exact)
- Blog posts: 45 / 45 (100%)
The single coupon lost was a case-collision: Shopify is case-insensitive on coupon codes but allowed two coupons with codes differing only in case to exist simultaneously. WooCommerce normalises to lowercase. One had to win. I documented it; the client agreed.
What I would do again
Plan mode for the architecture spike. Before any code, get the full plan in front of the human. Mistakes at the architecture level cost days, not hours, to unwind.
Local Docker sim that clones live exactly. Every bug I shipped to sim, I caught in sim. None of them made it to live. The cost of the sim setup is roughly half a day; the cost of finding bugs on live is whatever the client's recovery patience allows, which is usually less than half a day.
Idempotent loaders. The number of times I re-ran a partially failed loader during development is in the dozens. Each re-run cost nothing because the loaders converge. The number of times I re-ran on live: twice. Both safe.
One PHP file per entity, invoked through wp eval-file. This is the architectural choice that made the timing work. Anyone proposing per-record REST calls for a migration of this size should be politely asked to measure first.
HPOS-aware code from day one. If your target site has HPOS enabled and you write to the order tables directly, you will create a bug that takes a week to find. Always go through WC_Order and $order->save().
What is next for this site
The migration was the technical lift. The remaining work is the operational lift: tax setup with the client's accountant, DNS cutover from the current Shopify pointer to the Kinsta IPs, transactional email through a proper SMTP provider, consent management for analytics, ownership handover of every account into the client's name. That stuff is not interesting engineering, but it is what makes the difference between "site moved" and "site live".
If you are about to migrate a real eCommerce store off (or onto) Shopify, the takeaway is small: pick the right tool for the data shape, build the safety net before you write the loaders, and reconcile every number before you flip the DNS. The rest is execution.
Tools used: Python 3.11, RapidFuzz, WordPress 6.9, WooCommerce 10.7 (HPOS enabled), MariaDB 11.4, Docker Compose, wp-cli, Kinsta hosting. The pipeline itself is roughly 2,000 lines of Python and 1,500 lines of PHP. The architectural decisions matter more than the line count.