How to Deploy VSchema Changes Without Downtime

Applying a VSchema change to a live sharded keyspace is safe only when the physical schema, the routing document, and every proxy’s cached plans stay mutually consistent through an asynchronous, per-proxy reload — this page gives the sequence that guarantees it.

Where This Fits

A VSchema edit is not a database migration; it is a change to the routing contract that the VTGate routing layer compiles into a query plan for every statement. If you have not yet internalized the object hierarchy — keyspaces, primary vs. secondary vindexes, sequences, and routing rules — read Mastering VSchema Syntax and Structure first; this procedure assumes that grammar and focuses only on rolling a change out under production traffic. It sits inside VSchema Configuration & Routing Rule Management and leans on Online DDL orchestration whenever the routing change is paired with a physical schema change.

The audience is database platform engineers, MySQL SREs, and Python orchestration builders who own the deploy pipeline and cannot take a maintenance window to change how queries route.

Root Cause: Why a Naive Apply Drops Queries

vtctldclient ApplyVSchema does not push the new document to proxies. It writes it to the topology server (etcd/Consul), and each VTGate independently notices the change and reloads, governed by --srv_topo_cache_ttl and --srv_topo_cache_refresh. Propagation is therefore asynchronous and per-proxy: for a window of one or two seconds, some proxies plan against the new VSchema while others still hold the old one. During that window a query that is valid under exactly one of the two versions — a table bound to a vindex that the change adds or removes — fails to plan on whichever proxies disagree with it.

The second hazard is ordering against physical schema. A VSchema that references a column, sequence, or lookup table that the underlying MySQL instances do not yet have will plan a query the VTTablet cannot execute. The rule that eliminates both hazards is to make every step backward-compatible in isolation — expand before contract — so that at no instant is the live fleet holding a routing document that contradicts the physical schema or an in-flight query.

Solution: The Ordered Deploy Procedure

The safe order is fixed: land the physical change first, apply the additive VSchema, let the fleet converge, then remove anything old in a separate deploy. The one exception is a change that moves data by altering the primary (first) column_vindexes entry — that is a resharding operation, not a VSchema edit, and must go through a workflow, never this path.

1. Apply the physical schema change with Online DDL

If the routing change depends on a new column or lookup table, submit the DDL first and let it complete on every shard before touching the VSchema. Choosing between Vitess-native and external executors is covered in Vitess native Online DDL vs. external tools.

vtctldclient ApplySchema \
  --ddl-strategy='vitess' \
  --sql="ALTER TABLE customer ADD COLUMN email VARBINARY(255)" \
  commerce

Poll until the migration reaches complete on all targeted shards. Applying the VSchema before this converges is the classic misordering that produces cannot execute errors on lagging tablets.

vtctldclient GetSchemaMigrations commerce --format=json \
  | jq -r '.[] | "\(.shard)\t\(.status)"'
# every shard must read "complete" before proceeding

Tracking that convergence robustly across a large fleet is its own topic, detailed in tracking migration progress and state machines.

2. Dry-run the additive VSchema against the control plane

The new document must be a superset: it adds the vindex definition and its binding, and removes nothing a live query still uses. Prove it plans cleanly before persisting.

vtctldclient ApplyVSchema \
  --vschema="$(cat vschema/commerce.json)" \
  --dry-run \
  commerce

A non-zero exit, or a diff touching tables outside the intended change set, stops the deploy here — before any proxy sees it.

3. Apply, then read the document back

Only after a clean dry run, apply for real and confirm the topology server holds exactly the reviewed bytes.

vtctldclient ApplyVSchema \
  --vschema="$(cat vschema/commerce.json)" \
  commerce

vtctldclient GetVSchema commerce > /tmp/live.json
diff <(jq -S . vschema/commerce.json) <(jq -S . /tmp/live.json)

An empty diff means the control plane is authoritative for your artifact. The per-proxy reload begins immediately; step 5 confirms it landed.

4. Drive the whole sequence from one Python orchestrator

Python orchestration builders should wrap the same vtctldclient verbs so the automated path is byte-identical to the manual one, with the ordering enforced in code rather than a runbook. This mirrors the harness described in async VSchema validation workflows.

import json
import subprocess
import time


def _run(cmd: list[str]) -> str:
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode != 0:
        raise RuntimeError(f"{cmd[1]} failed: {result.stderr.strip()}")
    return result.stdout


def migrations_complete(keyspace: str) -> bool:
    out = _run(["vtctldclient", "GetSchemaMigrations", keyspace, "--format=json"])
    rows = json.loads(out)
    return bool(rows) and all(r["status"] == "complete" for r in rows)


def deploy_vschema(keyspace: str, vschema_path: str, timeout_s: int = 600) -> None:
    with open(vschema_path) as f:
        vschema = f.read()
    json.loads(vschema)  # fail fast on malformed JSON before any RPC

    # 1. wait for the physical DDL to converge on every shard
    deadline = time.monotonic() + timeout_s
    while not migrations_complete(keyspace):
        if time.monotonic() > deadline:
            raise TimeoutError("DDL did not reach 'complete' on all shards")
        time.sleep(5)

    # 2. dry-run the additive VSchema against the control plane
    _run(["vtctldclient", "ApplyVSchema",
          f"--vschema={vschema}", "--dry-run", keyspace])

    # 3. apply, then assert the topology server holds our exact artifact
    _run(["vtctldclient", "ApplyVSchema", f"--vschema={vschema}", keyspace])
    live = _run(["vtctldclient", "GetVSchema", keyspace])
    if json.loads(live) != json.loads(vschema):
        raise RuntimeError("persisted VSchema does not match applied artifact")

Gate this behind a dry-run flag in CI, and only let it apply for real after every offline and control-plane check has passed. The orchestrator should export DDL progress and VSchema version as metrics so an operator sees convergence in real time.

5. Remove the old binding in a separate, later deploy

Once every proxy serves the additive version and no query references the old vindex, a follow-up ApplyVSchema drops it. Splitting expand and contract across two deploys is what keeps the fleet consistent through the reload window — never combine “add the new” and “remove the old” in one apply while traffic depends on both.

Edge Cases and Gotchas

Contract folded into expand. Removing an old vindex in the same apply that adds the new one means the two-second reload window has proxies planning against mutually exclusive documents — queries fail on whichever proxies disagree. Always split into two deploys.
VSchema applied before DDL converges. If even one shard’s migration is still running, a query the new routing plans will hit a tablet without the column and error with cannot execute. Gate on all shards reading complete, not the first.
Reordered primary vindex. Moving the first column_vindexes entry changes the placement function and strands already-written rows on the wrong shard. This is data movement — route it through a resharding workflow, not this deploy.
Lookup table not backfilled. Binding a lookup vindex whose backing table has not been populated for existing rows makes secondary-column reads miss silently. Backfill and reconcile the lookup before the binding goes live.
In-band ALTER VSCHEMA left enabled. A non-empty --vschema_ddl_authorized_users lets a client mutate routing outside this reviewed path, bypassing the dry-run and version-controlled artifact. Leave it empty in production.
Plan-cache re-warm mistaken for an incident. A large VSchema reload invalidates each VTGate plan cache, so the first hit of every query shape re-plans — a brief CPU and latency bump that is expected, not a regression. Stage large changes off-peak.
Routing-rule collisions. If the change also touches routing rules, they resolve before the tables block and can shadow it; validate them against dynamic routing rules and query rewriting so a stale rule does not override the new vindex.

Verification

Confirm the fleet converged and the new routing behaves before calling the deploy done. Watch the scatter-plan rate on the keyspace across the reload window — it must not step up:

# Prometheus: scatter rate must stay flat through and after the reload
rate(vtgate_queries_processed{plan="ScatterGather",keyspace="commerce"}[5m])

Then prove the intended path resolves to a bounded shard set with VEXPLAIN on the newly routed column:

VEXPLAIN PLAN SELECT * FROM customer WHERE email = 'a@example.com';
-- expect a lookup-vindex route, not a broadcast/ScatterGather plan

A flat scatter rate plus a single-shard or lookup route on the target query confirms all proxies reloaded, the physical schema matches the routing document, and no query shape silently broadened.

Mastering VSchema Syntax and Structure — the object hierarchy and grammar this deploy procedure assumes.
Async VSchema Validation Workflows — validating a candidate document out of band before it reaches this pipeline.
Coordinating Multi-Shard Schema Migrations — keeping the DDL half of a paired change aligned across every shard.

← Back to Mastering VSchema Syntax and Structure

How to Deploy VSchema Changes Without Downtime

Where This Fits #

Root Cause: Why a Naive Apply Drops Queries #

Solution: The Ordered Deploy Procedure #

1. Apply the physical schema change with Online DDL #

2. Dry-run the additive VSchema against the control plane #

3. Apply, then read the document back #

4. Drive the whole sequence from one Python orchestrator #

5. Remove the old binding in a separate, later deploy #

Edge Cases and Gotchas #

Verification #

Related #