Automating VSchema Sync with Python Scripts

Applying a routing change by hand does not scale past a handful of keyspaces: the problem is not writing the JSON, it is guaranteeing that every ApplyVSchema lands exactly once, in the right order relative to schema migrations, without ever leaving the routing graph half-updated.

Where This Fits

This page owns the apply-and-reconcile half of an out-of-band routing pipeline. Its companion, Async VSchema Validation Workflows, verifies a candidate routing definition offline before it is ever dispatched; here we take a definition that has already cleared validation and drive it safely onto live infrastructure. Both build on the abstractions in VSchema Configuration & Routing Rule Management — keyspaces, primary and lookup vindexes, sequence tables, and routing rules — and both assume you are comfortable with how the VTGate routing layer turns a stored VSchema into a query plan. The scripts below interface with the control plane through vtctldclient (gRPC) or the vtadmin REST API and submit ApplyVSchema directives against a running vtctld.

The Reconcile Loop

A production sync is not a single API call; it is a reconciliation loop that only mutates state when the desired routing definition diverges from what the control plane currently serves. The orchestrator computes a hash of the target payload, fetches the live VSchema for the keyspace, and short-circuits when the two agree. This idempotency is what makes the sync safe to run on every merge to main, on a cron, and on operator retry — three triggers that would otherwise stack duplicate ApplyVSchema calls and churn VTGate plan caches for no reason.

The loop tracks each keyspace through an explicit lifecycle — pending, applying, validating, committed — and a failed validation rolls back to pending rather than committing a partial graph. A sync only leaves pending when the current and target hashes diverge, so a converged keyspace never advances.

Coordinating the Apply with Online DDL

The single most damaging race in a sharded environment is applying a routing change before the schema it references exists on the shards. Vitess runs non-blocking migrations through Online DDL orchestration — gh-ost or pt-online-schema-change under the hood, selected by --ddl_strategy — and those tools operate independently of routing metadata. If a sync registers a new lookup vindex before the backing column has finished copying, VTGate will plan point queries against an index that does not yet exist and return 1146 (unknown table) or 1054 (unknown column) errors. Apply the routing change too late, and stale plan caches reject queries that are already valid.

The orchestrator therefore gates the apply on migration completion. It polls vtctldclient GetSchemaMigrations <keyspace> (or the vtadmin migrations endpoint) until the migration reports complete on every shard, then dispatches ApplyVSchema, then waits for routing convergence. Because the state of each migration is authoritative in the topology server, the poll is the same signal used by migration progress tracking — the sync simply subscribes to it rather than guessing at timing.

To keep concurrent runs from racing each other, wrap the critical section in a keyspace lock. vtctldclient LockKeyspace <keyspace> takes a topology-server lock so that two pipelines cannot dispatch overlapping ApplyVSchema calls and interleave a corrupt routing graph.

Implementation

The orchestrator below is built on requests for the vtadmin REST surface and tenacity for bounded retries. The apply method is a compare-and-swap: it fetches the live definition, hashes both sides, and only writes when they differ.

import hashlib

import requests
from tenacity import retry, stop_after_attempt, wait_exponential


class VSchemaSyncOrchestrator:
    def __init__(self, vtadmin_url: str, timeout: int = 30):
        self.vtadmin_url = vtadmin_url.rstrip("/")
        self.timeout = timeout

    def _compute_payload_hash(self, payload: str) -> str:
        return hashlib.sha256(payload.encode()).hexdigest()

    def _fetch_current_vschema(self, keyspace: str) -> str:
        """Fetch the live VSchema JSON string for a keyspace via the vtadmin REST API."""
        # vtadmin REST path: GET /api/keyspaces/{keyspace}/vschema
        response = requests.get(
            f"{self.vtadmin_url}/api/keyspaces/{keyspace}/vschema",
            timeout=self.timeout,
        )
        response.raise_for_status()
        return response.text  # raw JSON for hashing

    @retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=2, min=4, max=30))
    def apply_vschema(self, keyspace: str, vschema_payload: str):
        """Idempotent VSchema application with a compare-and-swap guard."""
        current = self._fetch_current_vschema(keyspace)
        if self._compute_payload_hash(current) == self._compute_payload_hash(vschema_payload):
            return {"status": "skipped", "reason": "schema_already_synced"}

        # vtadmin REST path: POST /api/keyspaces/{keyspace}/vschema
        response = requests.post(
            f"{self.vtadmin_url}/api/keyspaces/{keyspace}/vschema",
            json={"vschema": vschema_payload},
            timeout=self.timeout,
        )
        response.raise_for_status()
        return response.json()

    def coordinate_online_ddl(self, keyspace: str, ddl_uuid: str, updated_vschema: str):
        """Phased coordination: wait for DDL completion, then sync the VSchema."""
        # vtadmin REST path: GET /api/keyspaces/{keyspace}/migrations/{uuid}
        resp = requests.get(
            f"{self.vtadmin_url}/api/keyspaces/{keyspace}/migrations/{ddl_uuid}",
            timeout=self.timeout,
        )
        resp.raise_for_status()
        status = resp.json().get("status")
        if status == "complete":
            return self.apply_vschema(keyspace, updated_vschema)
        raise RuntimeError(f"DDL {ddl_uuid} not ready. Status: {status}")

Two design choices carry the reliability of this pattern. First, tenacity.retry with exponential backoff and jitter absorbs transient control-plane unavailability — a vtctld restart or a topology-server leader election — without amplifying load. Second, migration state must be persisted outside the process. Keeping it in a distributed key-value store (etcd or Redis) rather than in script memory lets a crashed run resume mid-pipeline and lets multiple regions coordinate against one source of truth, so a restart never re-applies a change that already committed.

When a sync coincides with a shift in traffic — legacy queries that must keep resolving while a new vindex propagates — the orchestrator can stage temporary overrides in the routing_rules object described under dynamic routing rules and query rewriting, then remove them once the primary vindex is live. Validate the assembled payload against the Vitess VSchema JSON schema before dispatch, exactly as VSchema syntax and structure prescribes; a malformed routing_rules entry is a common cause of silent routing degradation.

Edge Cases and Gotchas

Non-canonical JSON breaks the hash guard. vtadmin may return keys in a different order or with different whitespace than your source file, so a byte-level sha256 will report divergence on identical semantics and re-apply forever. Normalise both sides with json.dumps(obj, sort_keys=True, separators=(",", ":")) before hashing.
Partial DDL completion. coordinate_online_ddl must confirm complete on every shard, not just the first one the API returns. A migration that finished on three of four shards will still route half your queries into 1146 errors.
Lock leakage. If the process dies holding a LockKeyspace lock, subsequent runs block until the lease expires. Always release in a finally block and set a bounded lock TTL.
Unbounded retries during a real outage. stop_after_attempt(5) is deliberate — pair it with a circuit breaker that halts propagation across keyspaces once more than ~15% of shards report DDL or apply failures, so one bad region does not fan a corrupt change out fleet-wide.
Skipping validation on “trivial” changes. Every payload, including a one-line routing-rule tweak, must clear the offline checks first. The apply path assumes its input is already valid; it is not a second line of defence.
Reload lag mistaken for failure. After ApplyVSchema succeeds, VTTablet refreshes its in-memory schema on the interval set by --queryserver-config-schema-reload-time. Poll for convergence rather than asserting success the instant the API returns 200.

Verification

Confirm the applied definition matches your source of truth by diffing the live VSchema against the artifact you dispatched — the sync is only complete when this returns nothing:

diff <(vtctldclient GetVSchema commerce | jq -S .) <(jq -S . vschema/commerce.json)

A clean diff proves the control plane serves exactly the reviewed payload. For convergence at the routing layer, watch VTGate’s VSchemaReloads counter climb to match the number of vtgate pods, then run a representative point query through VEXPLAIN PLAN <sql> and confirm it resolves to a single shard rather than a scatter — the definitive signal that routing metadata and physical schema now agree.

Async VSchema Validation Workflows — the offline verification stage that gates every payload this orchestrator applies.
How to Deploy VSchema Changes Without Downtime — the cutover sequencing this sync automates.
Tracking Migration Progress and State Machines — the completion signal the DDL-coordination step subscribes to.

← Back to Async VSchema Validation Workflows

Automating VSchema Sync with Python Scripts

Where This Fits #

The Reconcile Loop #

Coordinating the Apply with Online DDL #

Implementation #

Edge Cases and Gotchas #

Verification #

Related #