Designing Horizontal Shard Topologies

Horizontal sharding is how a single MySQL keyspace grows past the write ceiling of one primary: the data is split across many independent VTTablet/mysqld pairs, each owning a contiguous slice of the key space, while VTGate presents the fleet to the application as one logical database. This page resolves the concrete design problem that precedes any of that automation working correctly — deciding how many shards to provision, where each shard’s key range begins and ends, and which failure domain each tablet lands in — so that the topology scales linearly, splits cleanly, and survives the loss of a zone. It is written for database platform engineers and MySQL SREs who own the shard map and the on-call consequences of getting it wrong. The upstream partitioning decision — range versus hash versus lookup distribution — is assumed settled here; this page is about turning that decision into a physical layout.

Prerequisites

Before provisioning shards, confirm the following are in place:

Vitess 18.0 or later (vtctldclient command surface and native Reshard v2 workflows). Earlier releases use vtctlclient and a different reshard syntax.
A running topology server — an etcd (v3) or Consul cluster reachable from every cell — that stores the keyspace, shard, and tablet records that make up the serving graph.
A chosen partitioning model. You must already know whether the keyspace is range-, hash-, or lookup-distributed; work through Understanding Vitess Keyspace Partitioning Models first if that is unsettled, because it determines whether shard boundaries are semantic or purely arithmetic.
A shard-count estimate grounded in projected write QPS, dataset size, and per-shard IOPS headroom — the arithmetic is worked through in How to Calculate Optimal Shard Count for MySQL.
A sharding key candidate with high cardinality and uniform entropy; validating that choice against a real access pattern is covered in Shard Key Selection Best Practices for E-commerce.
Familiarity with the query path — how the stateless VTGate routing layer turns a WHERE predicate into a targeted or scatter query — since the topology exists to keep hot-path queries single-shard.

Core Mechanism: How a Row Resolves to a Shard

A Vitess shard is not a named bucket you assign rows to — it is a key-range interval over an 8-byte keyspace ID space, and a row belongs to whichever shard’s interval contains its keyspace ID. Understanding this arithmetic is what makes the rest of the design deterministic.

The keyspace ID. For every row, VTGate computes a keyspace ID by applying the table’s primary vindex to its sharding-key column. A hash vindex, for example, runs the value through a null-key DES hash to produce a uniformly distributed 8-byte value. That keyspace ID — not the raw column — is what the shard ranges are compared against.

Key ranges as binary prefixes. Shards are named by the lower and upper bound of their range in hexadecimal, and the bound is a prefix, not a full 8 bytes. -80 means “keyspace IDs from 0x00… up to but excluding 0x80…”; 80- means “0x80… to the maximum.” A two-shard keyspace splits the space at the top bit. A four-shard keyspace splits at the top two bits: -40, 40-80, 80-c0, c0-. Because every boundary is a binary prefix, a shard always bisects cleanly into two children that together cover exactly the parent’s range:

-80  splits into  -40  and  40-80
80-  splits into  80-c0 and c0-

This is why practitioners provision shard counts as powers of two. A 4-shard keyspace can grow to 8 by bisecting each shard once, with no keyspace ID ever changing owner except across a single clean boundary — which is exactly what makes VReplication-backed resharding a mechanical copy rather than a full redistribution. Provisioning 5 or 6 shards forces uneven ranges and turns every future split into a bespoke migration.

The serving graph binds it together. The topology server stores each shard’s range and the current healthy tablet for every (keyspace, shard, tablet_type) tuple. VTGate caches this serving graph and, for a query with an equality predicate on the sharding key, computes one keyspace ID, finds the one shard whose range contains it, and issues a single targeted query. With no such predicate it falls back to scatter-gather across every shard — so the whole point of the topology design is to align shard boundaries with the dominant access pattern, keeping hot-path reads targeted. Cross-shard writes that cannot be avoided escalate into distributed transactions, whose commit cost is the tax you are designing to minimise.

Step-by-Step: Provisioning a Sharded Topology

The sequence below builds a four-shard commerce keyspace from an unsharded starting point. Each step is independently verifiable, so you can halt and inspect state before proceeding.

1. Declare the sharded VSchema. The primary vindex is what makes a table routable. Without it, VTGate scatters every query regardless of how well the physical shards are laid out. Define the vindex and bind the sharding-key column of each table to it:

{
  "sharded": true,
  "vindexes": {
    "hash": { "type": "hash" }
  },
  "tables": {
    "orders": {
      "column_vindexes": [
        { "column": "customer_id", "name": "hash" }
      ]
    },
    "customers": {
      "column_vindexes": [
        { "column": "id", "name": "hash" }
      ]
    }
  }
}

Sharding orders by customer_id (rather than order_id) co-locates a customer’s whole order history on one shard, so the common “fetch this customer’s orders” read stays single-shard. Apply it:

vtctldclient ApplyVSchema --vschema-file commerce.vschema.json commerce

2. Create the target shards. Declare the four key ranges in the topology server. This creates the shard records; no tablets serve them yet:

for range in -40 40-80 80-c0 c0-; do
  vtctldclient CreateShard commerce/$range
done
vtctldclient GetShards commerce   # expect the four ranges listed

3. Provision tablets across failure domains. Each shard needs at least a PRIMARY, a promotable REPLICA, and an RDONLY for batch and VReplication source traffic — and no two of a shard’s tablets may share a failure domain. Register each tablet in the cell that maps to its availability zone so the topology reflects physical placement:

# One tablet of shard 40-80, pinned to cell zone-b
vttablet \
  --topo_implementation etcd2 \
  --topo_global_server_address etcd.internal:2379 \
  --tablet-path zone-b-0000000201 \
  --init_keyspace commerce \
  --init_shard 40-80 \
  --init_tablet_type replica \
  --db-config-app-uname vt_app

Placement is the design decision that most often gets skipped. A shard whose PRIMARY and only REPLICA share a rack has no real availability — losing that rack loses the shard. Spread each shard’s tablets so that any single zone failure leaves a promotable replica and preserves topology-server quorum.

4. Initialise a primary per shard. Pick the starting PRIMARY for each shard and let Vitess wire up replication:

for range in -40 40-80 80-c0 c0-; do
  vtctldclient PlannedReparentShard commerce/$range \
    --new-primary zone-a-$(printf '%010d' $((200 + RANDOM % 3)))
done

Durable failover depends on semi-synchronous replication and calibrated health thresholds at this layer — the full tablet configuration is detailed in Configuring VTTablet for High Availability.

5. Move data in from the unsharded source. For a greenfield split, use MoveTables to copy from the original unsharded keyspace into the new sharded one; for growing an already-sharded keyspace, use Reshard to bisect. Both are VReplication workflows that copy online and keep the source authoritative until you switch:

vtctldclient Reshard --workflow split4 --target-keyspace commerce create \
  --source-shards '-80,80-' --target-shards '-40,40-80,80-c0,c0-'

vtctldclient Reshard --workflow split4 --target-keyspace commerce show

6. Cut over traffic in stages. Switch reads first — RDONLY, then REPLICA — verify, then switch writes. Staging the cutover keeps the write path revertible until the very last step:

vtctldclient Reshard --workflow split4 --target-keyspace commerce switchtraffic \
  --tablet-types rdonly,replica
vtctldclient Reshard --workflow split4 --target-keyspace commerce switchtraffic \
  --tablet-types primary

7. Drive it from automation idempotently. Orchestration controllers should read persisted workflow state before acting so retries never double-switch. The control plane is spoken over vtctldclient / the vtadmin API, not the SQL port:

import json, subprocess

def workflow_state(keyspace: str, workflow: str) -> dict:
    out = subprocess.run(
        ["vtctldclient", "Workflow", "--keyspace", keyspace,
         "show", "--workflow", workflow],
        capture_output=True, text=True, check=True,
    )
    return json.loads(out.stdout or "{}")

def ready_to_switch(keyspace: str, workflow: str) -> bool:
    st = workflow_state(keyspace, workflow)
    streams = st.get("workflows", [{}])[0].get("shard_streams", {})
    # Only switch when every copy stream has reached the running/replicating phase.
    return all(
        s.get("status") == "Running"
        for shard in streams.values()
        for s in shard.get("streams", [])
    )

Configuration Reference

The flags below govern how the topology behaves once tablets are serving. Defaults suit a demo; the recommended column is a production starting point to tune against measured load.

Flag	Component	Type	Default	Recommended (production)
`--init_shard`	`VTTablet`	string	(none)	The exact key range this tablet serves, e.g. `40-80` — must match a shard record
`--init_tablet_type`	`VTTablet`	enum	`replica`	`replica` or `rdonly`; never bootstrap directly as `primary`
`--health_check_interval`	`VTTablet`	duration	`20s`	`5s`–`10s` so `VTOrc` detects a failed primary within RTO
`--degraded_threshold`	`VTTablet`	duration	`30s`	Tune to replication SLA; when a replica exceeds it, it stops serving reads
`--transaction_mode`	`VTGate`	enum	`MULTI`	`SINGLE` unless atomic cross-shard writes are required — makes accidental cross-shard writes fail fast
`--max_memory_rows`	`VTGate`	int	`300000`	Lower toward `100000` so a runaway scatter fails before it OOMs the gateway
`--warn_sharded_only`	`VTGate`	bool	`false`	`true` in staging to surface unintended scatter-gather during design validation
`--stream_health_buffer_size`	`VTTablet`	int	`20`	Raise on fleets with many `VTGate` watchers to avoid dropped health updates
`--vreplication_copy_phase_max_innodb_history_list_length`	`VTTablet`	int	`1000000`	Throttle the copy phase before InnoDB history bloats during a reshard

Failure Modes Specific to Topology Design

Uneven ranges forcing a bespoke split. Root cause: a shard count that is not a power of two, or hand-assigned ranges of unequal width, so no clean bisection boundary exists. Symptoms: Reshard requires manually computed target ranges; one shard’s mysqld shows disproportionate size and IOPS in per-shard metrics. Mitigation: re-provision to a power-of-two layout; when splitting an oversized shard, bisect on its natural midpoint (-80 → -40, 40-80) rather than inventing an arbitrary boundary.

Co-located quorum. Root cause: a shard’s PRIMARY and REPLICA land in the same availability zone or rack despite appearing as distinct tablets. Symptoms: a single zone outage marks an entire shard unavailable; vtctldclient GetTablets shows no surviving promotable replica for that shard. Mitigation: enforce anti-affinity at provisioning time by mapping cells to zones and spreading each shard’s tablets across at least three; audit placement before go-live, not after the incident.

Hotspot shard from a low-entropy key. Root cause: a sharding key that concentrates writes — a monotonic timestamp under a range vindex, or a tenant ID where one tenant dominates. Symptoms: one shard’s Threads_running and write latency climb while peers idle; scatter latency is dominated by the slowest shard. Mitigation: revisit the key against the guidance in Shard Key Selection Best Practices for E-commerce; if the key cannot change, add a secondary lookup vindex so hot access paths still resolve to one shard.

Silent scatter from a missing primary vindex. Root cause: a table declared in the sharded VSchema without a column_vindexes entry, so VTGate cannot compute a keyspace ID and fans every query out. Symptoms: p99 latency scales with shard count; the single-shard hit-rate metric collapses; --warn_sharded_only logs the table. Mitigation: every routable table needs a primary vindex before the shards go live; catch omissions with vtexplain in the deploy pipeline.

Cutover with an unhealthy shard. Root cause: switchtraffic issued while one shard has a lagging replica or a stalled copy stream. Symptoms: post-cutover reads on the affected range return stale or missing rows; the workflow reports a stream not in Running state. Mitigation: gate the switch on the health check in step 7 above; keep the write cutover last and revertible until every stream is confirmed running.

Verification

Confirm the topology is serving as designed before declaring it production-ready.

Every shard has a healthy serving set. List tablets by type and confirm each of the four ranges reports one PRIMARY and at least one healthy REPLICA:

vtctldclient GetTablets --keyspace commerce --tablet-type primary
vtctldclient GetTablets --keyspace commerce --tablet-type replica

Hot-path queries route to one shard. Inspect the plan offline against the VSchema — a query with the sharding-key predicate must produce a single-shard Route, not a Scatter:

vtexplain --vschema-file commerce.vschema.json --schema-file schema.sql \
  --shards 4 --sql "SELECT * FROM orders WHERE customer_id = 42"

Distribution is even. Watch per-shard mysql.global.status.bytes and VTGate QueriesRouted by shard. A well-designed topology shows write volume and dataset size within a narrow band across shards; a persistent outlier is an early signal of a skewed key or an uneven range, well before it becomes a latency page. Graceful degradation when a shard does go down is handled by Implementing Fallback Routing for Shard Outages.

Understanding Vitess Keyspace Partitioning Models — range, hash, and lookup distribution and how each shapes shard boundaries.
Shard Key Selection Best Practices for E-commerce — choosing a key with the cardinality and entropy to avoid write hotspots.
Configuring VTTablet for High Availability — semi-sync replication, health thresholds, and reparent workflows per shard.
VTGate Routing Architecture Deep Dive — how the router turns a predicate into a targeted or scatter plan.
Securing Multi-Tenant Sharded Databases — tenant isolation and blast-radius control across shared shards.

← Back to Vitess Sharding Architecture & Topology Design

Designing Horizontal Shard Topologies

Prerequisites #

Core Mechanism: How a Row Resolves to a Shard #

Step-by-Step: Provisioning a Sharded Topology #

Configuration Reference #

Failure Modes Specific to Topology Design #

Verification #

Related #

Go deeper

Related in Sharding Architecture & Topology