Coordinating Multi-Shard Schema Migrations

A schema change that is a one-line ALTER TABLE on a monolith becomes a distributed coordination problem the moment the table lives in a sharded keyspace. The same logical migration must run on every shard’s primary, each with its own replication-lag profile, hardware, and failure domain — and the moment any shard begins serving the new column shape, in-flight queries can observe two different schemas at once. This page resolves one specific operational challenge: how to drive a single schema change across dozens of shards so that the row copy runs within safe backpressure limits and the traffic-visible cutover happens atomically across the whole keyspace, with a deterministic recovery path when one shard stalls. It sits under Online DDL orchestration and migration coordination, which defines the migration state machine and control surface this page builds on.

Prerequisites

Before coordinating a fleet-wide migration, confirm the following are in place:

Vitess 14+ (managed Online DDL with --postpone-completion and per-shard SHOW VITESS_MIGRATIONS reporting; 16+ recommended for the improved throttler and force-cutover semantics).
A vitess-strategy DDL path. The keyspace must submit through managed Online DDL rather than direct. If you are still deciding, weigh the trade-offs in Vitess native Online DDL vs external tools; the coordination model here assumes the vitess (VReplication) strategy, though it applies equally to gh-ost runs wrapped by the same orchestrator.
The lag throttler enabled on every keyspace so the row copy has a backpressure signal.
A persisted control-plane store (etcd, the topology server, or a control database) that survives orchestrator restarts — the coordination logic is only safe if per-shard state outlives the process driving it, as detailed in tracking migration progress and state machines.
Working knowledge of the routing layer. You should understand how the stateless VTGate routing layer fans a query out to shards, because the cutover switches what every VTGate sees at once.
vtctldclient and vtadmin access to the live cluster, plus a Python DB-API driver (PyMySQL or mysqlclient) pointed at VTGate for polling migration state over ordinary SQL.

How coordination works across shards

The core insight is that a multi-shard migration is not one operation — it is N independent per-shard migrations plus a global barrier that holds all of them at the edge of cutover until every one is ready. Vitess runs the row copy on each shard’s primary independently through a VReplication stream that writes into a shadow table and keeps it in sync by tailing the binary log. Left to their own devices, each shard would cut over the instant its copy finished, and the keyspace would spend an unbounded, uncontrolled window in a mixed-schema state. Coordination replaces that with a two-phase rhythm: let every shard reach a postponed-complete state at its own pace, then fire the cutover across the whole fleet in a tight sequence.

Three invariants make this safe, and every step below exists to enforce one of them:

Idempotency — re-submitting or resuming a migration must never start a second row copy on a shard that already has one.
Schema version pinning — every shard must converge on the identical target schema hash, so a partially applied change is detectable and reversible.
Cross-shard atomicity of cutover — the instant the application begins to see the new schema must be coordinated across shards, or an in-flight cross-shard transaction can read two different table shapes in a single logical statement.

The coordination mechanism in detail

Two levers drive coordinated DDL: the postponement flag that decouples copy from cutover, and the concurrency budget that bounds how much of the fleet copies at once.

--postpone-completion is the load-bearing directive. Without it, each shard’s VReplication stream, once caught up, immediately performs the atomic table swap and the migration on that shard reports complete. With it, the shard reaches the copy-done state, keeps the shadow table in sync via ongoing binlog tailing, and parks in a running/postponed status waiting for an explicit completion signal. That parked state is the global barrier: the orchestrator holds every shard there, verifies the entire fleet is caught up and within lag thresholds, and only then issues the completion that triggers the synchronized cutover.

The concurrency budget governs the copy phase. Running the row copy on all shards simultaneously maximizes throughput but can saturate replica I/O across the fleet at once — every shard’s replicas fall behind together, and the lag throttler pauses everything. Running strictly one shard at a time is safe but can stretch a large migration across days. Most teams pick a middle policy: cap the number of shards in the active copy phase to a small N (commonly 2–4), let each finish and park at the barrier, then admit the next. The Vitess lag throttler operates underneath this budget — it watches replica lag on each shard and pauses that shard’s copy whenever lag exceeds --throttle-threshold, resuming automatically when replicas recover. The concurrency budget is coarse fleet-level admission control; the throttler is fine per-shard backpressure.

Because the heavy row copy is where lag and I/O pressure concentrate, teams commonly time it against per-region traffic troughs — see scheduling DDL windows across multiple timezones for aligning the copy with each keyspace’s quietest hours.

Step-by-step: running a coordinated migration

Each step below is independently verifiable — you can confirm its effect before proceeding.

1. Submit the migration with postponement. Submit through VTGate (ordinary SQL) or the control plane. The --postpone-completion flag guarantees no shard cuts over on its own.

-- Submitted over the VTGate MySQL protocol; Vitess fans this out to every shard
SET @@ddl_strategy = 'vitess --postpone-completion';
ALTER TABLE orders ADD COLUMN fulfilment_center_id BIGINT UNSIGNED NULL;

Equivalently through vtctldclient, which returns the migration_uuid you will track for the rest of the run:

vtctldclient ApplySchema \
  --ddl-strategy "vitess --postpone-completion" \
  --sql "ALTER TABLE orders ADD COLUMN fulfilment_center_id BIGINT UNSIGNED NULL" \
  commerce

Verify: SHOW VITESS_MIGRATIONS LIKE '<uuid>' returns one row per shard, all in queued or running.

2. Enforce the concurrency budget. The orchestrator, not Vitess, decides how many shards copy at once. A simple admission loop keeps at most N shards out of the queued state by leaving the rest postponed at submission and releasing them in waves, or by submitting per shard. The scheduling directive on the strategy caps in-flight work at the tablet level:

# Cap concurrent Online DDL jobs per tablet so a wave cannot saturate replica I/O
vtctldclient ApplySchema \
  --ddl-strategy "vitess --postpone-completion --allow-concurrent" \
  --sql "..." commerce

Verify: count rows with migration_status = 'running' in SHOW VITESS_MIGRATIONS — it should never exceed your budget.

3. Poll per-shard state until the barrier is reached. Treat VTGate as a normal MySQL server and poll SHOW VITESS_MIGRATIONS. The barrier is reached only when every shard reports the copy-done/postponed state and no shard has failed.

import time
import pymysql

FAILED = {"failed", "cancelled"}

def shard_states(conn, uuid):
    """Return {shard: migration_status} for one migration across every shard."""
    with conn.cursor(pymysql.cursors.DictCursor) as cur:
        cur.execute("SHOW VITESS_MIGRATIONS LIKE %s", (uuid,))
        return {r["shard"]: r["migration_status"] for r in cur.fetchall()}

def await_barrier(conn, uuid, ready="complete", poll=15):
    """Block until every shard is caught up and postponed at the cutover gate."""
    while True:
        states = shard_states(conn, uuid)
        if any(s in FAILED for s in states.values()):
            raise RuntimeError(f"migration {uuid} failed on a shard: {states}")
        if states and all(s == ready for s in states.values()):
            return states           # global barrier reached — safe to cut over
        time.sleep(poll)

conn = pymysql.connect(host="vtgate.internal", port=15306, db="commerce")

Verify: the function returns a dict where every shard maps to the same postponed-ready status.

4. Fire the coordinated cutover. Only after the barrier is reached, complete the migration on the keyspace. Vitess sequences the atomic table swaps across shards in a tight window, minimizing the mixed-schema exposure to the cutover duration rather than the whole copy duration.

# Trigger the synchronized cutover across all shards at once
vtctldclient OnlineDDL complete commerce <migration_uuid>

Verify: every shard advances from postponed to complete within seconds of each other in SHOW VITESS_MIGRATIONS.

5. Warm the new state before declaring victory. The freshly built table has a cold InnoDB buffer pool and the query optimizer must recompile plans against the new definition, so p99 spikes right after cutover. Replay a sampled, representative read workload through VTGate against the migrated table and watch p99 settle back to baseline before transitioning the migration to a terminal state in your control store.

Configuration reference

Flag / setting	Type	Default	Recommended (production)
`--ddl-strategy`	string	`direct`	`vitess --postpone-completion` for coordinated multi-shard cutover
`--allow-concurrent` (strategy flag)	flag	off	enable to run non-conflicting migrations in parallel under a budget
`--singleton-context` (strategy flag)	flag	off	enable to reject a second concurrent migration on the same table
Orchestrator concurrency budget (shards in copy)	int	n/a	`2`–`4`; raise only if replicas are provisioned for fleet-wide copy I/O
`--throttle-threshold` (throttler lag)	duration	`1s`	`1s`–`5s`; the per-shard backpressure signal — do not disable
`--migration-check-interval` (VTTablet)	duration	`1m`	`10s`–`30s` for tighter per-shard progress polling
`--cutover-threshold`	duration	`10s`	keep low; a high value lengthens the write-lock window at swap
`--retain-online-ddl-tables`	duration	`24h`	`24h`–`72h` so the original table survives a same-day rollback

The misconfigurations that cause the most pain are predictable: submitting with direct on a sharded keyspace runs a blocking ALTER on every primary simultaneously (a fleet-wide stall); omitting --postpone-completion forfeits the barrier and produces a mixed-schema window bounded only by copy skew between shards; and setting --retain-online-ddl-tables too low lets the table garbage collector drop the original table before a rollback can re-swap it, turning a reversible change into a full re-copy. A column change also frequently needs a matching update to the VSchema routing contract — for example when a new column must be referenced by a lookup vindex — or routing stays correct for the old shape and breaks for queries assuming the new one.

Failure modes specific to multi-shard coordination

Copy skew stalls the barrier. Symptom: most shards sit postponed-ready for hours while one or two remain running with throttler metric mysql_lag above threshold. Root cause: the lag throttler is correctly pausing the slow shards’ copies because their replicas cannot keep up. Mitigation: let the throttler self-heal; if a shard never catches up, lower its copy concurrency or reschedule its copy into an off-peak window. Never force the cutover before the barrier — completing while a shard is still copying produces exactly the mixed-schema state the barrier exists to prevent.

Partial cutover across shards. Symptom: after OnlineDDL complete, some shards report complete and one reports failed; the keyspace is serving two schemas. Root cause: a shard-local failure (lock contention, primary failover) fired during the cutover fan-out. Mitigation: the orchestrator must treat the fan-out as all-or-nothing. On any shard’s cutover failure, immediately re-swap the shards that already cut over back to their retained original table, then re-queue the failed shard — only possible because --retain-online-ddl-tables kept the originals alive. Automate this as a compensating action, not a manual page.

Lock contention at the swap. Symptom: one shard’s cutover times out; metadata-lock waits spike in performance_schema.metadata_locks. Root cause: a long-running transaction or open cursor holds the table’s metadata lock, so the atomic RENAME cannot acquire it inside --cutover-threshold. Mitigation: kill or wait out the blocking transaction and retry the cutover on that shard. For external-tool runs this is the classic pattern detailed in resolving gh-ost lock contention in sharded MySQL.

Orphaned shadow tables after a cancel. Symptom: _vt-prefixed artifact tables accumulate and disk usage climbs on a subset of shards. Root cause: a cancelled or crashed migration left its shadow table behind on the shards it had reached. Mitigation: Vitess’s table garbage collector reclaims these on the retention schedule, but the orchestrator should also run an explicit vtctldclient OnlineDDL cleanup commerce <uuid> sweep and alert if the _vt table count on any shard exceeds a baseline.

Every recovery path must be idempotent: read persisted per-shard state before acting, confirm which shards have cut over, and either roll the whole fleet forward or roll it entirely back to one consistent schema. A restarted orchestrator that re-reads state before issuing any action never double-submits or double-cleans.

Verifying a coordinated cutover

Confirm the migration converged the whole keyspace on the new schema, not just a subset:

-- Every row should show the same migration_status and identical schema hash
SHOW VITESS_MIGRATIONS LIKE '<migration_uuid>';

A clean run shows one row per shard, all complete, all sharing the same ddl_action and target schema, with no shard left in running or cutover. Cross-check three signals: no shard reports failed or cancelled; the throttler lag metric on every shard is back under threshold; and a representative query fanned through VTGate resolves against the new column shape on every shard. For a scripted gate, assert that len(shard_states(conn, uuid)) == expected_shard_count and that every value equals complete before marking the migration terminal in your control store. Only when all three hold is the coordinated migration truly done.

Vitess Native Online DDL vs External Tools — choosing the engine that drives each shard’s copy and cutover.
Tracking Migration Progress and State Machines — the authoritative per-shard state model and how controllers survive restarts.
Resolving gh-ost Lock Contention in Sharded MySQL — diagnosing and clearing the metadata-lock stalls that break a shard’s cutover.
Scheduling DDL Windows Across Multiple Timezones — timing the heavy row copy against per-region traffic troughs.

← Back to Online DDL Orchestration & Migration Coordination · Related area: Vitess Sharding Architecture & Topology Design

Coordinating Multi-Shard Schema Migrations

Prerequisites #

How coordination works across shards #

The coordination mechanism in detail #

Step-by-step: running a coordinated migration #

Configuration reference #

Failure modes specific to multi-shard coordination #

Verifying a coordinated cutover #

Related #

Go deeper

Related in Online DDL Orchestration