VSchema Configuration & Routing Rule Management in Distributed MySQL Topologies

VSchema is the declarative contract that turns a fleet of independent MySQL instances into a single logical database with deterministic query routing. This reference is written for database platform engineers, MySQL SREs, and Python orchestration builders who own that contract in production: it defines the core routing abstractions, works through each routing subsystem with runnable configuration, catalogues the tuning knobs and failure modes that break routing under load, and shows how automation pipelines apply and validate VSchema safely across hundreds of shards.

Control plane vs. data plane: vtctldclient/vtadmin write VSchema and routing rules into the topology server; every VTGate watches it and reloads asynchronously, then routes each query either single-shard or scatter-gather to the shards.

Physical Topology vs. Logical Routing

Vitess decouples application query execution from the underlying MySQL infrastructure through a stateless proxy layer and a tablet-managed data plane. Distinguishing the physical layer from the logical layer is the first prerequisite for reasoning about routing. The physical layer comprises keyspaces, shards, and MySQL instances, with cluster state persisted in a distributed coordination backend such as etcd or Consul. The logical layer is governed entirely by VSchema — a configuration artifact that dictates query parsing, routing topology, and execution planning. The two layers meet inside the VTGate routing architecture: the proxy owns the logical view, while VTTablet and MySQL own the physical bytes.

When a client connection reaches VTGate, the proxy parses the incoming SQL, resolves the target keyspace, and consults the active VSchema to generate an execution plan. Routing decisions are driven by vindexes (virtual indexes), which map logical column values to specific physical shards. This architecture eliminates application-level sharding logic but imposes strict operational requirements: a misaligned routing configuration silently degrades into scatter-gather execution, amplifies cross-shard transaction overhead, and saturates connection pools on the busiest MySQL instances.

The Logical Model: Keyspace, VSchema, Vindex, Routing Rule

Four abstractions govern every routing decision. Understanding how they compose is prerequisite to every later section.

Keyspace — a logical database that maps to one or more physical shards. A keyspace is either unsharded (a single shard, 0) or sharded (a set of key-range shards such as -80 and 80-). The keyspace-level VSchema declares which mode is in force via the sharded boolean.
VSchema — the JSON document, stored per keyspace in the topology server, that binds logical tables to vindexes and declares any global routing directives. It is the authoritative routing contract; VTGate holds no routing knowledge that does not originate here.
Vindex — the function that maps a logical column value to a keyspace ID, which in turn falls inside exactly one shard’s key range. A table’s primary vindex (the first entry in its column_vindexes list) determines shard placement on INSERT; secondary vindexes let VTGate prune the shard set for WHERE predicates on non-primary columns.
Routing rule — a keyspace/table-level redirect evaluated before vindex resolution. Routing rules rewrite the target of a query (for example, pointing commerce.orders at commerce_new.orders mid-migration) without any application change.

A minimal sharded VSchema ties these together. The vindexes block defines reusable functions; each table’s column_vindexes binds a column to one of them:

{
  "sharded": true,
  "vindexes": {
    "hash": { "type": "hash" },
    "customer_lookup": {
      "type": "consistent_lookup_unique",
      "params": {
        "table": "commerce.customer_lookup",
        "from": "email",
        "to": "keyspace_id"
      },
      "owner": "customer"
    }
  },
  "tables": {
    "customer": {
      "column_vindexes": [
        { "column": "id", "name": "hash" },
        { "column": "email", "name": "customer_lookup" }
      ]
    }
  }
}

Here id is the primary vindex (hash) that places every customer row; email is a secondary lookup vindex that lets a query filtering on email resolve to a single shard instead of fanning out. The exact grammar — token types, sequence tables, unindexed behaviour, and reference tables — is covered in depth under mastering VSchema syntax and structure.

VSchema as the Routing Contract

VSchema is the authoritative interface between application data models and the distributed topology. Because VTGate derives every plan from it, three properties must hold for routing to stay deterministic.

Every routable table needs a primary vindex. Omitting one, or declaring a table only in an unsharded keyspace when its data actually spans shards, forces VTGate into fallback scatter execution — the query fans out to all shards, bypasses shard-local index optimisation, and produces latency that scales with shard count rather than result size. VEXPLAIN PLAN <sql> surfaces this before it reaches production: a plan whose OperatorType is Route with Scatter variant on a query that should be single-shard is the canonical symptom.

VSchema is infrastructure-as-code. Changes must be version-controlled, peer-reviewed, and applied through automated pipelines rather than by hand. The apply path is vtctldclient ApplyVSchema, which writes the new document into the topology server:

vtctldclient ApplyVSchema \
  --vschema-file commerce.vschema.json \
  commerce

Propagation is eventually consistent. Every VTGate watches the topology server and reloads the keyspace VSchema asynchronously. During the reload window, different VTGate instances may plan the same query against different VSchema versions. Any orchestration that applies VSchema must therefore treat the write as the start of a rollout, poll each VTGate for the new version, and verify health before declaring success — never assume the write is globally visible the moment ApplyVSchema returns. Applying schema and VSchema in the correct order without a routing gap is the subject of deploying VSchema changes without downtime.

Vindex Strategy and Cross-Shard Execution

The vindex layer is the mechanism that keeps queries single-shard. Choosing the wrong vindex type is the most common root cause of scatter amplification, so the taxonomy matters.

Choosing a vindex is choosing which access patterns get single-shard latency: a functional vindex computes the shard in one hop, while a lookup vindex trades a second hop and a consistency obligation for the ability to route on a secondary key.

Functional vindexes compute the keyspace ID directly from the column value — no extra table, no extra hop. hash and xxhash suit high-cardinality integer or UUID keys; unicode_loose_md5 suits case-insensitive string keys. These are the fast path: an O(1) computation resolves the shard.

Lookup vindexes maintain an external mapping table that correlates a secondary column value to a keyspace ID, letting VTGate route on a column that is not the primary sharding key. They are essential when the data model demands access by more than one key — for example, fetching an order by order_id when the table is sharded by customer_id. The trade-off is a second query hop and a consistency obligation: an owned, consistent_lookup vindex updates the mapping transactionally with the base row, whereas an unowned lookup can drift if writes bypass Vitess. Configuring lookup vindexes for cross-shard joins covers ownership, backfill, and shard-affinity requirements; the high-QPS tuning of these structures — plan caching, buffer-pool sizing, avoiding multi-column lookups — is handled in optimizing vindex performance for high QPS.

Under sustained write load, SREs must treat lookup tables as first-class operational objects: monitor row growth, shard the lookup table identically to its base table to avoid a scatter during the mapping phase, and align refresh cadence with write volume so the routing cache does not thrash. The choice between a functional and a lookup vindex is ultimately a choice about which access patterns get single-shard latency and which pay the cross-shard tax — a decision that flows directly from the keyspace partitioning model selected upstream.

Routing Rules, Query Rewriting, and Traffic Shifting

Beyond static vindex mappings, Vitess supports dynamic routing rules that operate at the keyspace and table level. A routing rule intercepts a query’s target and redirects it before vindex resolution runs. Rules are evaluated in order and the first match wins, which makes precedence — not just correctness — a property you must design for.

{
  "rules": [
    {
      "from_table": "commerce.orders",
      "to_tables": ["commerce_new.orders"]
    },
    {
      "from_table": "commerce.orders@replica",
      "to_tables": ["commerce.orders"]
    }
  ]
}

This pattern underpins blue-green cutovers, gradual resharding, and legacy migrations: read traffic can be pinned to replicas, tables can be moved between keyspaces transparently, and a botched target can be reverted by rewriting one rule. Because a stray or mis-ordered rule can silently blackhole or duplicate traffic, dynamic routing rules and query rewriting documents the full evaluation model, and debugging VSchema routing rule conflicts walks through diagnosing precedence collisions with VEXPLAIN. Rules are applied with vtctldclient ApplyRoutingRules and, like VSchema, propagate to every VTGate asynchronously — so a rule deployment carries the same version-and-verify obligation as any other routing change.

Async Validation and Online DDL Coordination

VSchema does not evolve in isolation. When a table’s structure changes, its routing definition frequently must change with it — a new column may become a secondary vindex, an altered key may change shard placement, a dropped column may invalidate a lookup. Applying a schema change and a routing change out of order across a sharded keyspace is a classic trigger for cascading routing failures, which is why VSchema management is inseparable from Online DDL orchestration.

The safe pattern is to gate every VSchema change behind asynchronous validation. An async VSchema validation workflow parses the proposed document, simulates routing plans against a corpus of historical queries, and checks compatibility with existing lookup tables before the change touches the topology server. Wiring that validation into CI/CD turns configuration drift into a build-time failure instead of a production incident, and produces an auditable change record for every routing mutation. When schema and VSchema must move together, the DDL and the ApplyVSchema call are sequenced so that no VTGate ever plans a query against a table whose physical shape and logical routing disagree.

Operational Considerations: Tuning Knobs and Misconfigurations

VTGate caches parsed VSchema, routing rules, and vindex mappings in memory to avoid a topology lookup on every query. Caching accelerates routing but opens a consistency window during updates, so the flags below must be calibrated against the workload rather than left at defaults.

Flag	Type	Default	Recommended (production)
`--query_timeout`	duration	`0` (unbounded)	`30s` — bound worst-case scatter fan-out
`--max_memory_rows`	int	`300000`	`100000` — cap rows buffered for scatter aggregation to protect heap
`--queryserver-config-transaction-timeout`	duration	`30s`	`20s` — release cross-shard transaction locks early
`--transaction_mode`	enum	`MULTI`	`MULTI` (use `TWOPC` only where atomic cross-shard writes are mandatory)
`--schema_change_reload_timeout`	duration	`30s`	`30s` — keep unless VSchema reload is slow at scale
`--enable-views`	bool	`false`	enable only if the VSchema declares views

Common misconfigurations reduce to a few recurring mistakes: declaring a table without a primary vindex (forces scatter on every access); leaving --query_timeout unbounded (a single missing predicate becomes a fleet-wide scan that exhausts connection pools); using a string vindex such as unicode_loose_md5 on an integer column (wastes CPU on hash computation and can skew distribution); and defining a lookup vindex on a table sharded differently from its lookup table (turns every routed read into a cross-shard join). Each of these is invisible in a functional test and only manifests under production concurrency, which is why plan inspection with VEXPLAIN belongs in the deploy pipeline, not the incident retrospective.

Failure Modes and Recovery Patterns

Routing failures are rarely loud; they present as latency and pool exhaustion long before they present as errors. The named scenarios below cover the failures that VSchema owners actually page on.

Scatter storm from a dropped predicate. Root cause: an application change or a rewritten query loses the sharding-key predicate, so VTGate fans every request across all shards. Symptoms: single-shard hit-rate metric collapses, p99 latency climbs with shard count, MySQL Threads_running spikes uniformly. Mitigation: bound --query_timeout and --max_memory_rows so the blast radius is capped; confirm the regression with VEXPLAIN PLAN; restore the predicate or add the missing secondary vindex.

VSchema propagation skew. Root cause: an ApplyVSchema reaches some VTGate instances before others, so the fleet plans the same query two different ways during the reload window. Symptoms: intermittent routing errors or duplicated writes that correlate with the deploy timestamp. Mitigation: poll every VTGate for the new VSchema version before shifting traffic; never sequence a routing-rule cutover to begin until all nodes report the target version.

Lookup vindex drift. Root cause: an unowned lookup table falls out of sync with its base table, so VTGate routes reads to a shard that no longer holds the row. Symptoms: “row not found” on a key that demonstrably exists; mismatch between lookup-table count and base-table count. Mitigation: prefer owned consistent_lookup vindexes so mapping and row update in one transaction; run a periodic reconciliation job; rebackfill from the base table when drift is detected.

Routing-rule precedence collision. Root cause: two rules match the same table and the wrong one wins on order. Symptoms: traffic lands in the wrong keyspace during a migration; reads and writes disagree on target. Mitigation: keep the rule set minimal and ordered most-specific-first; validate with VEXPLAIN before apply; keep the previous rule set as an immediate rollback artifact.

Python Orchestration Integration

Platform and automation teams interact with this layer through the vtctldclient gRPC API and the vtadmin HTTP API rather than by editing files on disk. The durable pattern is an idempotent apply-and-verify loop: write the routing change, then poll until every serving VTGate reports the expected version before continuing. The sketch below wraps ApplyVSchema with that discipline — exponential backoff, an explicit version check, and a hard timeout — which is exactly the contract automating VSchema sync with Python scripts builds out in full.

import subprocess, time, json

def apply_vschema(keyspace: str, vschema_path: str) -> None:
    subprocess.run(
        ["vtctldclient", "ApplyVSchema",
         "--vschema-file", vschema_path, keyspace],
        check=True,
    )

def wait_for_convergence(keyspace: str, expected_rev: str,
                         gates: list[str], timeout_s: int = 60) -> None:
    """Block until every VTGate serves the expected VSchema revision."""
    deadline = time.monotonic() + timeout_s
    backoff = 1.0
    pending = set(gates)
    while pending:
        if time.monotonic() > deadline:
            raise TimeoutError(f"VSchema {expected_rev} not converged on {pending}")
        for gate in list(pending):
            live = json.loads(
                subprocess.check_output(
                    ["vtctldclient", "GetVSchema", keyspace]))
            if live.get("_rev") == expected_rev:
                pending.discard(gate)
        if pending:
            time.sleep(backoff)
            backoff = min(backoff * 2, 8.0)

The same loop generalises to ApplyRoutingRules. Automation should always drive routing changes through validation (parse, simulate plans, diff against production traffic), an idempotent apply, and topology-state polling — never a fire-and-forget write. Application traffic, meanwhile, reaches VTGate through a standard MySQL driver, so the routing layer stays invisible to the query path even as orchestration rewrites it underneath.

Mastering VSchema Syntax and Structure — the full VSchema grammar: vindex types, sequences, reference tables, and table bindings.
Configuring Lookup Vindexes for Cross-Shard Joins — ownership, backfill, and shard affinity for routing on secondary keys.
Dynamic Routing Rules and Query Rewriting — rule evaluation order, traffic shifting, and transparent table moves.
Async VSchema Validation Workflows — gate every routing change behind plan simulation and CI/CD checks.
VTGate Routing Architecture Deep Dive — how the proxy parses SQL, caches plans, and dispatches to shards.

← Back to Vitess Sharding Architecture & Topology Design

VSchema Configuration & Routing Rule Management in Distributed MySQL Topologies

Physical Topology vs. Logical Routing #

The Logical Model: Keyspace, VSchema, Vindex, Routing Rule #

VSchema as the Routing Contract #

Vindex Strategy and Cross-Shard Execution #

Routing Rules, Query Rewriting, and Traffic Shifting #

Async Validation and Online DDL Coordination #

Operational Considerations: Tuning Knobs and Misconfigurations #

Failure Modes and Recovery Patterns #

Python Orchestration Integration #

Related #

Go deeper