Configuring VTTablet for High Availability

Getting a VTTablet to survive the loss of its primary without dropping writes, promoting a stale replica, or splitting the brain comes down to three settings that must agree: semi-synchronous replication at the MySQL layer, the tablet’s health-probe cadence, and the reparent thresholds the control plane reads before it promotes.

Where This Fits

Each shard in a sharded keyspace is a small replication group — one PRIMARY tablet and its promotable REPLICA and RDONLY siblings, each a VTTablet wrapping a mysqld. When you lay out a topology in Designing Horizontal Shard Topologies, you decide which tablets exist and where they land; this page is about the per-tablet configuration that decides whether the loss of a PRIMARY is a sub-second reparent or a data-loss incident. The availability of the whole keyspace is the availability of its weakest shard, so these settings are applied identically to every tablet in the fleet.

The tablet is the failure detector and the failover actor. VTOrc watches the replication topology and decides when to reparent; the tablet publishes the health and GTID state VTOrc reads, and executes the mysqld-level replication changes a reparent requires. Downstream, the stateless VTGate routing layer re-points write traffic the moment the topology record for the shard names a new PRIMARY. If the tablet’s health signal is slower than VTGate’s view of the serving graph, writes keep flowing to a demoted primary — the classic split-brain window.

The Concept: RPO Lives in Semi-Sync, RTO Lives in the Health Loop

Two objectives are in tension. Recovery Point Objective (how much committed data you can lose) is set at the MySQL layer: with asynchronous replication a PRIMARY acknowledges a commit before any replica has the binlog event, so a crash loses every in-flight transaction. Semi-synchronous replication closes that gap — the primary blocks the commit acknowledgement until at least one replica has written the event to its relay log, making a promotion lossless. Recovery Time Objective (how long you are down) is set by the health loop: how fast the tablet notices the primary is gone and how long the control plane waits for a healthy candidate before promoting.

VTTablet does not activate MySQL semi-sync through a single flag of its own; it observes and enforces the replication topology while you enable the MySQL-side variables through your mysqld configuration. On MySQL 8.0.26+ these are rpl_semi_sync_source_enabled on the primary and rpl_semi_sync_replica_enabled on replicas (pre-8.0.26 releases use the rpl_semi_sync_master_* / rpl_semi_sync_slave_* names). The critical companion is rpl_semi_sync_source_timeout: it bounds how long a primary blocks waiting for a replica ACK before it degrades to async so writes can continue. Set it long enough that a transient network blip does not silently drop you to async (and reopen the RPO gap), but short enough that a genuinely partitioned primary stalls rather than accepting writes no replica can see.

Configuration

1. Enable semi-sync at the MySQL layer. Put these in the mysqld config that every tablet’s mysqld inherits, so any tablet can serve as either role after a reparent:

[mysqld]
rpl_semi_sync_source_enabled  = 1
rpl_semi_sync_replica_enabled = 1
# Block commits up to 1s waiting for a replica ACK before degrading to async.
rpl_semi_sync_source_timeout  = 1000
# Require one replica ACK per commit (Vitess manages the effective count on reparent).
rpl_semi_sync_source_wait_for_replica_count = 1

Both source and replica flags are enabled on every node because roles swap during a reparent — the demoted primary must be ready to ACK for the promoted one.

2. Start the tablet with an identity and a health loop. The --health_check_interval is the single most important RTO knob: it is how often the tablet re-evaluates its own mysqld and republishes to the topology server. The default 20s is far too slow for production — a failed primary stays “healthy” in the serving graph for up to 20 seconds:

vttablet \
  --topo_implementation etcd2 \
  --topo_global_server_address etcd.internal:2379 \
  --tablet-path zone-a-0000000200 \
  --init_keyspace commerce \
  --init_shard 40-80 \
  --init_tablet_type replica \
  --health_check_interval 5s \
  --shutdown_grace_period 30s \
  --degraded_threshold 30s \
  --unhealthy_threshold 2h \
  --db-config-app-uname vt_app \
  --db-config-dba-uname vt_dba

Note the tablet is bootstrapped as replica, never as primary — the first primary is chosen by an explicit reparent (step 4) so the topology record is authoritative from the start.

3. Bound replication lag so a laggard stops serving reads and never gets promoted. --degraded_threshold is the lag at which a REPLICA withdraws from serving reads; a replica past it is also a poor promotion candidate. Keep it aligned with the replication SLA the application tolerates, and keep it well under rpl_semi_sync_source_timeout’s implications so a lagging replica is caught before it becomes the only survivor.

4. Elect the first primary and let Vitess wire replication. Promotion is never implicit. Use vtctldclient to name the starting primary; Vitess sets up semi-sync replication from it to the shard’s other tablets:

vtctldclient PlannedReparentShard commerce/40-80 \
  --new-primary zone-a-0000000200 \
  --wait-replicas-timeout 30s

5. Tune the two reparent paths. PlannedReparentShard is the graceful path — used for maintenance and rolling restarts; it demotes the current primary cleanly, waits for the chosen replica to catch up, then promotes, so it is lossless by construction. EmergencyReparentShard is the unplanned path — the primary is already gone, so it picks the candidate with the most-advanced GTID position and promotes it:

# Graceful switchover (maintenance): current primary still reachable.
vtctldclient PlannedReparentShard commerce/40-80 \
  --new-primary zone-b-0000000201 \
  --wait-replicas-timeout 30s

# Unplanned failover: primary is down, promote the furthest-ahead replica.
vtctldclient EmergencyReparentShard commerce/40-80 \
  --wait-replicas-timeout 30s

--wait-replicas-timeout is the RTO/RPO trade-off in one flag. On the emergency path it caps how long the operation waits for replicas to apply their outstanding relay logs before choosing a primary; too short and you promote a replica that has not finished applying what it already received, too long and you extend the outage. Thirty seconds is a sane production start; tune it against how far replicas realistically lag under your write load.

Configuration Reference

Flag / variable	Layer	Type	Default	Recommended (production)
`--health_check_interval`	`VTTablet`	duration	`20s`	`5s`–`10s` so a dead `PRIMARY` leaves the serving graph within RTO
`--degraded_threshold`	`VTTablet`	duration	`30s`	Set to the replication-lag SLA; a replica past it stops serving reads
`--unhealthy_threshold`	`VTTablet`	duration	`2h`	Lag beyond which the tablet is fully unhealthy; keep well above `degraded_threshold`
`--shutdown_grace_period`	`VTTablet`	duration	`0s`	`30s`+ so in-flight queries drain and transactions are handed off on restart
`--wait-replicas-timeout`	`vtctldclient` reparent	duration	`30s`	Cap on waiting for replica catch-up before promoting; tune to real lag
`rpl_semi_sync_source_enabled`	`mysqld`	bool	`0`	`1` on every node — required for zero-RPO promotion
`rpl_semi_sync_source_timeout`	`mysqld`	int (ms)	`10000`	`1000` — bound the async-degrade window without flapping on jitter
`rpl_semi_sync_source_wait_for_replica_count`	`mysqld`	int	`1`	`1` for a 3-tablet shard; raise only with enough replicas to ACK

Edge Cases and Gotchas

Split-brain from a slow health loop. If --health_check_interval is larger than VTGate’s serving-graph refresh, a demoted primary keeps receiving writes after a reparent. Keep the health interval at 5s–10s and confirm VTGate observes topology changes faster than the tablet publishes them.
Silent degrade to async. A too-short rpl_semi_sync_source_timeout under normal network jitter drops the primary to asynchronous replication on every blip, reopening the RPO gap without any failover occurring. Watch Rpl_semi_sync_source_no_tx / Rpl_semi_sync_source_status — a primary showing OFF is running async and a crash there loses data.
Promoting a lagging replica. On the emergency path, --wait-replicas-timeout set too aggressively can promote a replica that has received but not applied recent events. Gate promotion candidacy on --degraded_threshold so a lagging replica is already out of the serving set.
Single-zone quorum. Zero-RPO config is worthless if the only ACK-ing replica shares a failure domain with the primary — losing that zone loses both. Spread each shard’s tablets across zones at provisioning time, as covered in Designing Horizontal Shard Topologies.
No grace period on restart. With --shutdown_grace_period 0s, a rolling restart kills in-flight transactions instead of draining them. Set a non-zero grace period so VTGate re-routes cleanly during planned maintenance.
DDL colliding with a reparent. A schema change that is mid-copy when a shard reparents can stall or roll back; sequence long-running migrations against the shard’s health, as covered in tracking migration progress and state machines.

Verification

Confirm semi-sync is actually acknowledging on the primary — this is the check that proves the config is lossless, not merely present:

vtctldclient ExecuteFetchAsDBA zone-a-0000000200 \
  "SHOW STATUS LIKE 'Rpl_semi_sync_source_status'"
# Expect: Rpl_semi_sync_source_status = ON

Then rehearse the failover itself. Trigger a graceful reparent and confirm the shard names a new PRIMARY and that VTGate follows it:

vtctldclient PlannedReparentShard commerce/40-80 --new-primary zone-b-0000000201
vtctldclient GetTablets --keyspace commerce --shard 40-80 --tablet-type primary
# Expect exactly one PRIMARY, now zone-b-0000000201, and writes continuing uninterrupted.

If Rpl_semi_sync_source_status reads ON and a planned reparent completes with writes flowing to the new primary, the shard is configured for high availability. Graceful degradation when a shard has no promotable primary left is handled separately by Implementing Fallback Routing for Shard Outages.

Designing Horizontal Shard Topologies — laying out shards and tablets across failure domains so this HA config has somewhere safe to promote.
Implementing Fallback Routing for Shard Outages — what VTGate does when a shard has no promotable primary at all.
Handling Cross-Shard Transactions in Vitess — how commit semantics interact with per-shard failover during a distributed write.

← Back to Designing Horizontal Shard Topologies

Configuring VTTablet for High Availability

Where This Fits #

The Concept: RPO Lives in Semi-Sync, RTO Lives in the Health Loop #

Configuration #

Configuration Reference #

Edge Cases and Gotchas #

Verification #

Related #

Related in Designing Horizontal Shard Topologies