Online DDL Orchestration & Migration Coordination in Sharded Vitess/MySQL Architectures

Schema evolution in a distributed relational database is one of the most operationally sensitive workflows a platform team owns: a single ALTER TABLE that would be routine on a monolith becomes a fan-out of long-running row copies, replica catch-up windows, and routing cutovers spread across every shard in a keyspace. This reference is written for MySQL SREs, Python orchestration builders, and distributed systems teams who need to execute those changes without downtime, without corrupting query routing, and without leaving orphaned artifacts behind when something stalls. It defines the core abstractions, walks the subsystems that make coordinated DDL safe, tabulates the tuning knobs that matter in production, and shows how automation engineers drive the whole loop through the vtctldclient and vtadmin control surface.

Online DDL orchestration resolves the tension between horizontal scaling and schema change by transforming an isolated ALTER into a distributed, stateful workflow. Instead of issuing DDL directly against a MySQL primary, you submit a migration to Vitess, which owns a per-shard job that copies rows into a shadow table, keeps it in sync via binary-log tailing, and atomically swaps it in during a brief cutover. The orchestration layer above that coordinates when each shard runs, enforces a global barrier before any traffic switches, and guarantees that a failure on shard 7 does not leave shards 0 through 6 on a new schema the application can no longer route to consistently.

The Python control loop drives the Vitess control plane, which dispatches a VReplication row copy to every shard; the routing cutover fires only after all shards clear the global barrier.

Architectural Decoupling & Execution Models

The foundation of distributed schema management is decoupling the logical schema definition from the physical execution path. In a sharded keyspace, each shard is an independent MySQL replica set — often on heterogeneous hardware, in different failure domains, with its own replication lag profile. Vitess hides routing complexity behind the stateless VTGate routing layer and the per-shard VTTablet that manages each mysqld, but schema propagation still demands deliberate choices about how the change executes on each instance.

The first decision is the execution engine. Comparing Vitess native Online DDL against external tools surfaces distinct trade-offs. The native path submits the migration through Vitess’s own scheduler, which drives either MySQL’s in-server ALGORITHM=INPLACE/INSTANT mechanics or a Vitess-managed vitess strategy that performs a VReplication-based row copy. External toolchains — gh-ost and pt-online-schema-change — instead run their own trigger-less or trigger-based copy loop and must be adapted to the topology so the orchestrator can observe their progress. The vitess strategy is the default for sharded keyspaces because it reuses the same VReplication stream primitive that powers resharding, so a single mechanism handles both row movement and schema change.

You request a strategy per migration with the DDL strategy directive:

-- Submit through VTGate; Vitess fans the migration out to every shard in the keyspace
SET @@ddl_strategy = 'vitess --postpone-completion';
ALTER TABLE orders ADD COLUMN fulfilment_center_id BIGINT UNSIGNED NULL;

# Or submit and track explicitly through the control plane
vtctldclient ApplySchema \
  --ddl-strategy "vitess --postpone-completion" \
  --sql "ALTER TABLE orders ADD COLUMN fulfilment_center_id BIGINT UNSIGNED NULL" \
  commerce

Regardless of engine, the architecture must enforce three invariants. Idempotency: re-submitting the same migration must not start a second row copy. Schema version pinning: every shard must converge on the exact same target schema hash, so a partially applied change is detectable and reversible. Cross-shard atomicity of the cutover: the moment when the application begins to see the new schema must be coordinated, or in-flight cross-shard transactions can observe two different table shapes at once. Understanding the storage-engine behaviour during a rebuild — which operations are truly in-place versus which force a full table copy — is essential for predicting lock contention and I/O, and the official MySQL Online DDL documentation is the authoritative reference for that matrix.

The Migration State Machine: Core Abstractions

Every coordinated migration is modelled as a state machine, and treating that machine as the authoritative control plane is what makes schema change observable, auditable, and reversible. A migration is not an event; it is a long-running entity with an identity (its migration_uuid), a target schema hash, and a current phase. Vitess exposes this directly — SHOW VITESS_MIGRATIONS returns one row per shard per migration, each carrying a status the orchestrator polls.

The canonical phases a production migration advances through are:

Phase	Meaning	Precondition to advance
`queued`	Accepted, waiting for a scheduler slot	Concurrency budget available on the shard
`ready`	Scheduled, about to start	Prior migration on the table cleared
`running`	Shadow table created, rows copying via `VReplication`	Copy throttled under lag threshold
`complete` (postponed)	Row copy done, kept in sync, awaiting cutover	Replica lag and binlog position aligned
`cutover`	Atomic table swap on the primary	Global barrier: all shards `complete`
`complete`	Cutover finished, traffic on new schema	—
`failed` / `cancelled`	Stalled or aborted; cleanup pending	Compensating cleanup must run

Because network partitions, mysqld restarts, and topology rebalancing are inevitable, each state transition must validate its own preconditions and every mutation must be strictly idempotent. The deep mechanics of persisting and advancing this machine — including how controllers survive restarts by reading persisted state before issuing resume, retry, or abort — are covered in tracking migration progress and state machines. The critical design property is that --postpone-completion decouples the row-copy phase from the cutover: every shard can independently reach complete (postponed) at its own pace, and the orchestrator only triggers the synchronized cutover once the slowest shard has caught up.

A shard progresses left to right; the accented global-barrier gate holds it at complete (postponed) until the whole fleet is caught up, and any failure diverts to an idempotent cleanup that can re-queue the run.

Concurrency Control & Multi-Shard Coordination

Coordinating a change across many shards introduces concurrency constraints that simply do not exist on a single node. Migrations must be serialized, parallelized, or phased according to shard topology, replication configuration, and traffic-routing policy. Running the row copy on all shards at once maximizes throughput but can saturate replica I/O simultaneously across the fleet; running strictly one shard at a time is safe but can stretch a large migration across days. The mechanics of picking and enforcing that policy are the subject of coordinating multi-shard schema migrations.

The coordinator maintains a global view of shard health, replication lag, and VTGate routing rules, and implements backpressure so the row copy never overwhelms replicas. Vitess’s own throttler is the primary control here: it watches replica lag and pauses the VReplication copy whenever lag exceeds a threshold, resuming automatically when replicas recover. The orchestrator layers a concurrency budget on top — a cap on how many shards may be in running at once.

End to end, the orchestration loop looks like the flow below: a declarative manifest is validated and pinned, each shard advances through the prepare → copy → cutover → cleanup phases, and traffic only switches once every shard clears a global barrier — otherwise the fallback chain reverts the change.

The full orchestration loop: a pinned manifest fans out to per-shard prepare, copy, cutover and cleanup; traffic switches only when every shard clears the barrier, otherwise the fallback chain reverts and hands control back to the orchestrator.

The barrier is the load-bearing concept. Without --postpone-completion, each shard cuts over the instant its copy finishes, so the keyspace spends an unbounded window in a mixed-schema state. With postponement, the orchestrator holds every shard at complete (postponed), verifies the whole fleet is caught up, then fires the cutover across all shards in a tight sequence:

# Fan the cutover out only after every shard reports complete + throttled-lag clear
vtctldclient OnlineDDL complete commerce <migration_uuid>

Operational Considerations: Tuning Knobs & Misconfigurations

Coordinated DDL is governed by a small set of flags whose defaults are tuned for safety, not speed. The table below lists the parameters that most affect the behaviour and duration of a migration, with production-oriented guidance.

Flag / setting	Type	Default	Recommended (production)
`--ddl-strategy`	string	`direct`	`vitess` for sharded keyspaces; add `--postpone-completion` for coordinated cutover
`--throttle-threshold` (throttler lag)	duration	`1s`	`1s`–`5s`; raise only if replicas are provisioned for it
`--migration-check-interval` (VTTablet)	duration	`1m`	`10s`–`30s` for tighter progress polling
`--retain-online-ddl-tables`	duration	`24h`	`24h`–`72h` so the artifact table survives a same-day rollback
`--cutover-threshold`	duration	`10s`	keep low; a high value lengthens the write-lock window at swap
`--singleton` / `--singleton-context` (strategy flag)	flag	off	enable to reject concurrent migrations on the same table
Throttler `--enable-lag-throttler`	bool	`true`	keep `true`; disabling removes the primary backpressure signal

The misconfigurations that cause the most production pain are predictable. Submitting with the direct strategy on a sharded keyspace bypasses managed Online DDL entirely and runs a blocking ALTER on every primary — a fleet-wide stall. Omitting --postpone-completion on a multi-shard change forfeits the global barrier and produces a mixed-schema window. Setting --retain-online-ddl-tables too low causes the garbage collector to drop the shadow/artifact tables before a same-day rollback can reuse them, turning a reversible change into a re-copy. And forgetting to update the VSchema routing contract after a column change — for instance adding a column that a lookup vindex needs to reference — leaves routing correct for the old shape but broken for queries that assume the new one.

Failure Modes & Recovery Patterns

Distributed DDL has non-trivial failure surfaces, and resilient orchestration means every one of them has a pre-defined, automated recovery path rather than a pager escalation. The following are the named scenarios worth encoding into runbooks and orchestration logic.

Stalled row copy from replica lag. Symptom: migration stuck in running, throttler metric mysql_lag above threshold, copy throughput near zero. Root cause: the lag throttler is doing its job — replicas cannot keep up with copy write volume. Mitigation: let the throttler self-heal; if lag never recovers, reduce copy concurrency or move the row copy to an off-peak window. Time-boxing this is why teams plan DDL windows across multiple timezones so the heavy copy runs when each region’s traffic is lowest.

Lock contention at cutover. Symptom: the final swap times out; metadata-lock waits spike in performance_schema. Root cause: a long-running transaction or open cursor holds the table’s metadata lock, so the atomic RENAME cannot acquire it within --cutover-threshold. Mitigation: kill or wait out the blocking transaction, then retry the cutover; for external-tool migrations this is the classic contention pattern detailed in resolving gh-ost lock contention in sharded MySQL.

Partial cutover across shards. Symptom: some shards complete, others failed, keyspace serving mixed schemas. Root cause: a shard-local failure fired after the barrier released and the cutover fan-out began. Mitigation: the orchestrator must treat the barrier fan-out as all-or-nothing — on any shard’s cutover failure, immediately reverse the shards that already cut over by re-swapping the retained original table, then re-queue the failed shard. This is only possible because --retain-online-ddl-tables kept the old tables alive.

Orphaned shadow/artifact tables. Symptom: _vt-prefixed tables accumulate; disk usage climbs. Root cause: a cancelled or crashed migration left its shadow table behind. Mitigation: Vitess’s table garbage collector reclaims these on the retention schedule, but orchestration should also run an explicit OnlineDDL cleanup sweep and alert if _vt table count exceeds a baseline.

Every recovery path must be idempotent so the orchestrator can safely resume or roll back without creating a second copy or leaving inconsistent metadata. The recovery checklist for any stalled migration is: (1) read persisted per-shard state before acting; (2) confirm which shards, if any, have cut over; (3) if the barrier had not released, cancel and clean up all shards; (4) if it had, roll forward or roll every shard back to a single consistent schema; (5) verify with SHOW VITESS_MIGRATIONS that no shard is left in running or cutover. The Vitess managed Online DDL reference documents how the internal queue manages job lifecycles, which is the baseline these fallback routines integrate with.

Post-Migration Stabilization & Cache Warming

A successful cutover does not guarantee immediate performance stability. Query optimizers must recompile execution plans against the new table definition, the InnoDB buffer pool for the freshly built table is cold, and application-layer caches may hold stale metadata or outdated query fingerprints. The result is a p99 latency spike in the minutes after cutover that looks like a regression but is just cold cache.

Orchestration pipelines close this gap by warming the new state before declaring victory: replaying a sampled, representative query workload through VTGate against the new schema, validating that routing plans resolve to single-shard execution where expected, and watching p99 until it settles back to baseline. Only then does the migration transition to a truly terminal complete. This stabilization window is short but load-bearing — skipping it converts a clean migration into a visible latency incident for downstream services.

Python Orchestration Integration

The audience for this layer is explicitly automation engineers, and the entire loop is drivable from Python. There are two integration surfaces: the gRPC/CLI control plane (vtctldclient, wrapped or shelled) and the standard MySQL protocol through VTGate (any Python DB-API driver such as PyMySQL or mysqlclient, since SHOW VITESS_MIGRATIONS and DDL submission are ordinary SQL over the VTGate connection). Most teams poll migration state over the SQL surface and reserve vtctldclient/vtadmin for lifecycle actions.

A minimal, idempotent poll-and-advance controller looks like this:

import time
import pymysql

TERMINAL = {"complete", "failed", "cancelled"}

def shard_states(conn, uuid):
    """Return {shard: status} for one migration across every shard."""
    with conn.cursor(pymysql.cursors.DictCursor) as cur:
        cur.execute("SHOW VITESS_MIGRATIONS LIKE %s", (uuid,))
        return {r["shard"]: r["migration_status"] for r in cur.fetchall()}

def await_barrier(conn, uuid, poll=15, ready="complete"):
    """Block until every shard is caught up and postponed at the cutover gate."""
    while True:
        states = shard_states(conn, uuid)
        if any(s in ("failed", "cancelled") for s in states.values()):
            raise RuntimeError(f"migration {uuid} failed on a shard: {states}")
        if states and all(s == ready for s in states.values()):
            return states           # global barrier reached — safe to cut over
        time.sleep(poll)

# VTGate speaks the MySQL protocol; connect exactly like a normal MySQL server
conn = pymysql.connect(host="vtgate.internal", port=15306, db="commerce")
conn.query("SET @@ddl_strategy = 'vitess --postpone-completion'")
conn.query("ALTER TABLE orders ADD COLUMN fulfilment_center_id BIGINT UNSIGNED NULL")
uuid = conn.query("SELECT LAST_INSERT_ID()")  # migration uuid is returned to the client

The controller wraps each control-plane call with exponential backoff and idempotent retries, and it always reads persisted state before issuing an action so a restarted controller never double-submits. Lifecycle transitions — completing the postponed cutover, cancelling, retrying, or cleaning up — are issued through the control surface:

vtctldclient OnlineDDL complete commerce <uuid>   # fire the coordinated cutover
vtctldclient OnlineDDL cancel   commerce <uuid>   # abort and clean up a stalled run
vtctldclient OnlineDDL retry    commerce <uuid>   # re-queue a failed shard
vtctldclient OnlineDDL cleanup  commerce <uuid>   # reclaim artifact tables early

For teams building dashboards or approval gates, vtadmin exposes the same migration inventory over an HTTP/gRPC API, which fits naturally behind a CI/CD pipeline that scores schema-change risk, enforces deployment windows, and records an auditable trail. Wiring pre-flight validation and mandatory post-cutover observability review into that pipeline is what turns schema evolution from a high-risk manual event into a repeatable, governed workflow that respects SRE error budgets.

Governance & Operational Maturity

As a sharded fleet grows, ad-hoc migration practice becomes untenable. Mature operations standardize on declarative migration manifests checked into version control, automated risk scoring (schema complexity, whether the change forces a full copy, backward-compatibility of the column change against live application code), and enforced deployment windows. These guardrails belong in the same CI/CD pipeline that ships application code, so that a schema change is peer-reviewed, dry-run-validated against the current topology, and gated on the same observability signals as any other production change. The payoff is that coordinated DDL stops being an event the on-call engineer dreads and becomes a predictable, automated capability the whole platform can rely on.

Vitess Native Online DDL vs External Tools — choosing between the managed vitess strategy, gh-ost, and pt-online-schema-change.
Coordinating Multi-Shard Schema Migrations — serialization, phasing, and the global cutover barrier across shards.
Tracking Migration Progress and State Machines — the authoritative per-shard state model and how controllers survive restarts.
Resolving gh-ost Lock Contention in Sharded MySQL — diagnosing and clearing metadata-lock stalls at cutover.
Scheduling DDL Windows Across Multiple Timezones — timing the heavy row copy against per-region traffic troughs.

← Back to shardedtopology.org · Related area: Vitess Sharding Architecture & Topology Design

Online DDL Orchestration & Migration Coordination in Sharded Vitess/MySQL Architectures

Architectural Decoupling & Execution Models #

The Migration State Machine: Core Abstractions #

Concurrency Control & Multi-Shard Coordination #

Operational Considerations: Tuning Knobs & Misconfigurations #

Failure Modes & Recovery Patterns #

Post-Migration Stabilization & Cache Warming #

Python Orchestration Integration #

Governance & Operational Maturity #

Related #

Go deeper