System Design - How Database Sharding Works

Open Table of contents

Context
Partitioning Strategies
Routing: How Queries Find Their Shard
- Vitess: A proxy-based example
- TiDB: Automatic range-based sharding
The Cross-Shard Problem
- Cross-shard queries
- Cross-shard transactions
Resharding: Adding and Removing Shards
Shard Key Selection Guide
- The celebrity problem
Sharding vs. Replication
Summary
References

Context

A single database server has hard limits: disk capacity, memory for indexes, CPU for queries, and network bandwidth. When your data or traffic outgrows one machine, you have two options:

Vertical scaling — buy a bigger machine (expensive, has a ceiling)
Horizontal scaling — spread data across multiple machines

Sharding is the technique of splitting a single logical database into multiple pieces (shards), each living on a different server. Each shard holds a subset of the data and handles a subset of the queries.

              Single Database (before sharding)

  +-------------------------------------------------------+
  |                    All 100M rows                       |
  |                    All queries                         |
  |                   (one machine)                        |
  +-------------------------------------------------------+

              Sharded Database (after)

  +------------------+  +------------------+  +------------------+
  |   Shard 0        |  |   Shard 1        |  |   Shard 2        |
  |   rows 0-33M    |  |   rows 34M-66M   |  |   rows 67M-100M  |
  |   (machine A)    |  |   (machine B)    |  |   (machine C)    |
  +------------------+  +------------------+  +------------------+

Sharding sounds simple in principle, but the devil is in the details: how do you decide which row goes where? How do you handle queries that span shards? What happens when you add or remove shards?

Partitioning Strategies

The shard key (or partition key) is the column used to determine which shard holds a given row. Choosing the right shard key is the single most important sharding decision.

Hash-Based Partitioning

Compute a hash of the shard key and map it to a shard number:

  shard_id = hash(shard_key) % num_shards

  Example with 4 shards:
  hash("user_123") = 7849201  -->  7849201 % 4 = 1  -->  Shard 1
  hash("user_456") = 2938471  -->  2938471 % 4 = 3  -->  Shard 3
  hash("user_789") = 5520000  -->  5520000 % 4 = 0  -->  Shard 0

Pros:

Even distribution (assuming a good hash function)
Simple to compute
No hotspot from sequential keys (e.g., auto-increment IDs won’t all land on one shard)

Cons:

Range queries become expensive (need to scatter to all shards)
Adding/removing shards requires rehashing most keys (unless using consistent hashing)

Who uses it: DynamoDB, Redis Cluster, Cassandra (with Murmur3 partitioner)

Range-Based Partitioning

Assign contiguous ranges of the shard key to each shard:

  Shard 0:  user_id [0, 1000000)
  Shard 1:  user_id [1000000, 2000000)
  Shard 2:  user_id [2000000, 3000000)

  Or for strings:
  Shard 0:  last_name [A-H]
  Shard 1:  last_name [I-P]
  Shard 2:  last_name [Q-Z]

Pros:

Range queries on the shard key are efficient (hit one shard or a small number)
Easy to understand and debug
Splitting a hot shard is straightforward (just split the range)

Cons:

Hotspots if access patterns cluster (e.g., recent timestamps always hit the last shard)
Data can become skewed if the key distribution is uneven

Who uses it: TiDB/TiKV (Regions are contiguous key ranges), CockroachDB (ranges), Google Spanner, HBase

Directory-Based Partitioning

Maintain a lookup table that maps each shard key (or group of keys) to a shard:

  +------------------+----------+
  | Shard Key Range  | Shard ID |
  +------------------+----------+
  | user_1 - user_99 |    0     |
  | user_100 - 500   |    1     |
  | user_501 - 999   |    2     |
  | user_1000 - 5000 |    0     |  <-- wraps around
  +------------------+----------+

  The directory itself is stored in a metadata service
  (ZooKeeper, etcd, a config database, etc.)

Pros:

Maximum flexibility — can move individual keys between shards
Can handle uneven data distribution by assigning more ranges to less-loaded shards

Cons:

Directory is a single point of failure (must be highly available)
Extra lookup on every query
Directory can become a bottleneck under high traffic

Who uses it: Vitess (VSchema), MongoDB (config servers), Azure Cosmos DB

Compound / Hierarchical Keys

Many systems use a two-level key: a partition key (determines the shard) plus a sort key (determines order within the shard):

  DynamoDB example:
  +---------------------+------------------+--------+
  | Partition Key       | Sort Key         | Value  |
  +---------------------+------------------+--------+
  | user_123           | order_2024_001   | {...}  |
  | user_123           | order_2024_002   | {...}  |
  | user_123           | order_2024_003   | {...}  |
  | user_456           | order_2024_001   | {...}  |
  +---------------------+------------------+--------+

  All items for user_123 live on the same shard
  --> efficient range queries within a user
  --> but user_123 can become a hotspot if very active

Routing: How Queries Find Their Shard

Once data is partitioned, the system needs to route each query to the correct shard. There are three common patterns:

  Pattern 1: Client-side routing

  +--------+                    +----------+
  | Client |---knows routing--->| Shard 2  |
  +--------+    table           +----------+
  (client library holds the mapping, e.g., Redis Cluster)


  Pattern 2: Proxy-based routing

  +--------+     +-------+     +----------+
  | Client |---->| Proxy  |---->| Shard 2  |
  +--------+     +-------+     +----------+
  (proxy knows the mapping, e.g., Vitess VTGate, ProxySQL)


  Pattern 3: Node-level routing (gossip)

  +--------+     +----------+     +----------+
  | Client |---->| Any Node |---->| Shard 2  |
  +--------+     +----------+     +----------+
  (any node can redirect to the correct shard, e.g., Cassandra)

Vitess: A proxy-based example

Vitess (used by YouTube, Slack, GitHub) routes MySQL queries through VTGate:

  Application
      |
      v
  +----------+    VSchema (mapping rules)
  | VTGate   |----+
  +----+-----+    |  "users table sharded by user_id,
       |          |   hash-based, 256 shards"
       v          |
  +----+-----+    |
  | VTTablet |<---+  (one per shard, wraps a MySQL instance)
  +----------+

  Query: SELECT * FROM users WHERE user_id = 123
  1. VTGate looks up VSchema: users is sharded by user_id
  2. Computes shard: hash(123) % 256 = shard_42
  3. Forwards query to VTTablet for shard_42
  4. Returns result to application

For queries that span multiple shards:

  Query: SELECT * FROM users WHERE age > 30

  VTGate must:
  1. Send query to ALL 256 shards (scatter)
  2. Collect partial results from each
  3. Merge and return to client (gather)

  This is expensive -- O(num_shards) fanout
  --> this is why shard key choice matters so much

TiDB: Automatic range-based sharding

TiDB shards data transparently. The SQL layer (TiDB server) is stateless; the storage layer (TiKV) holds data in Regions (contiguous key ranges, typically 96-256MB each):

  TiDB Server (stateless SQL)
      |
      v
  +----------+
  |    PD    |  <-- metadata: which Region holds which key range
  +----------+
      |
      v
  +--------+  +--------+  +--------+  +--------+
  |Region 1|  |Region 2|  |Region 3|  |Region 4|  ...
  |[a, f)  |  |[f, m)  |  |[m, t)  |  |[t, z)  |
  +--------+  +--------+  +--------+  +--------+

  When Region 2 grows > 256MB, PD splits it:
  Region 2a: [f, j)    Region 2b: [j, m)

  When Regions are small and adjacent, PD merges them.
  This happens automatically -- applications never think about shards.

The key encoding for a table row in TiKV:

  Key format: t{tableID}_r{rowID}

  Table 42, row 1000: t42_r1000
  Table 42, row 1001: t42_r1001

  These are lexicographically ordered, so rows with adjacent
  IDs land in the same Region (range-based partitioning).

  Index entries:
  t{tableID}_i{indexID}_{indexValue}_{rowID}

The Cross-Shard Problem

Sharding creates challenges when operations span multiple shards.

Cross-shard queries

  Query: SELECT u.name, o.total
         FROM users u JOIN orders o ON u.id = o.user_id
         WHERE o.created_at > '2026-01-01'

  If users and orders are sharded by user_id:
  --> Join is local (same shard has both user and their orders)
  --> The WHERE on created_at still needs scatter-gather

  If users is sharded by user_id and orders by order_id:
  --> Join requires cross-shard lookup for every row
  --> Much more expensive

Co-location (putting related data on the same shard) is the primary strategy to avoid cross-shard joins. This is why shard key selection is so critical — it determines which queries are efficient.

Cross-shard transactions

A single transaction that touches multiple shards requires a distributed commit protocol (usually two-phase commit):

  Transaction: transfer $100 from user A (shard 1) to user B (shard 3)

  Coordinator (e.g., TiDB server)
       |
       +---> Shard 1: PREPARE (lock A's row, deduct $100)
       |         |
       +---> Shard 3: PREPARE (lock B's row, add $100)
       |         |
       |    both respond "PREPARED"
       |         |
       +---> Shard 1: COMMIT
       +---> Shard 3: COMMIT

  If either shard fails to prepare --> ABORT both

Cross-shard transactions are significantly slower than single-shard ones (2-5x latency overhead due to extra network round-trips and lock holding time). Systems optimize by:

Encouraging shard-local transactions via co-location
Using optimistic concurrency control to reduce lock duration
Batching commits across shards

Resharding: Adding and Removing Shards

The hardest operational challenge in sharding is changing the number of shards. With naive hash partitioning (hash % N), changing N remaps almost every key.

Consistent Hashing

Maps keys to a ring rather than a fixed modulus:

  Hash ring (0 to 2^32)

         Shard A
           |
     0 ----+---- 2^32
    /                \
   /                  \
  Shard D          Shard B
   \                  /
    \                /
     \              /
      +----+----+
           |
         Shard C

  Adding Shard E between B and C:
  - Only keys between B and E move to E
  - All other keys stay put
  - Minimal data movement (~1/N of total keys)

See the consistent hashing post for the full algorithm.

TiDB’s approach: automatic splitting

TiDB avoids the resharding problem entirely by using very small ranges (Regions) that split and merge automatically:

  Step 1: Data grows
  Region [a, z): 300MB  -->  too large

  Step 2: PD splits at midpoint
  Region [a, m): 150MB
  Region [m, z): 150MB

  Step 3: PD schedules the new Region to a less-loaded TiKV node
  (Raft group membership change moves data)

  No "resharding event" -- it's continuous and automatic

Vitess resharding

Vitess supports online resharding without downtime using a workflow called VReplication:

  Before: 2 shards [-80, 80-]
  After:  4 shards [-40, 40-80, 80-c0, c0-]

  1. Create target shards (empty)
  2. Start VReplication streams: copy existing data + replicate changes
  3. Once caught up, do a brief write pause (< 1 second)
  4. Switch reads to new shards
  5. Switch writes to new shards
  6. Delete old shards

  +----------+     +---------+  +---------+
  | Shard -80|---->|Shard -40|  |Shard    |
  |          |---->|         |  |40-80    |
  +----------+     +---------+  +---------+

  +----------+     +---------+  +---------+
  |Shard 80- |---->|Shard    |  |Shard    |
  |          |---->|80-c0    |  |c0-      |
  +----------+     +---------+  +---------+

Shard Key Selection Guide

Choosing the wrong shard key leads to hotspots, expensive cross-shard queries, or data skew. Here’s a decision framework:

  +--------------------------------------------------+
  |  What are your most frequent queries?            |
  +--------------------------------------------------+
           |
           v
  +--------------------------------------------------+
  |  Which column appears in WHERE/JOIN most often?  |
  |  That's your shard key candidate.                |
  +--------------------------------------------------+
           |
           v
  +--------------------------------------------------+
  |  Does it have high cardinality?                  |
  |  (many distinct values)                          |
  |  Low cardinality = skewed shards                 |
  +--------------------------------------------------+
           |
           v
  +--------------------------------------------------+
  |  Is the access pattern uniform?                  |
  |  (celebrity problem: one key = 80% of traffic)   |
  |  If not, consider composite key or key salting   |
  +--------------------------------------------------+

Workload	Good shard key	Bad shard key	Why
Social media (user-centric)	user_id	post_id	Most queries are per-user
E-commerce	customer_id	order_date	Customer’s orders stay together
Time-series / IoT	device_id + time bucket	timestamp alone	Timestamp = all writes hit latest shard
Multi-tenant SaaS	tenant_id	auto-increment id	Tenant isolation, no cross-tenant joins

The celebrity problem

If one shard key value has disproportionate traffic (e.g., a viral post, a mega-tenant), that shard becomes a hotspot:

  Normal distribution:
  Shard 0: |||||||  (1000 QPS)
  Shard 1: |||||||  (1000 QPS)
  Shard 2: |||||||  (1000 QPS)

  With celebrity user on Shard 1:
  Shard 0: |||||||  (1000 QPS)
  Shard 1: |||||||||||||||||||||||||  (50,000 QPS)  <-- overwhelmed
  Shard 2: |||||||  (1000 QPS)

Solutions:

Key salting: Append a random suffix (user_123_0, user_123_1, …) to spread one key across shards. Reads must scatter-gather.
Dedicated shard: Move the hot key to its own, beefier shard.
Caching: Put a cache layer in front (Redis, Memcached) to absorb read traffic.
Application-level splitting: Split the hot entity’s data into sub-entities.

Sharding vs. Replication

These are complementary, not alternatives:

  Sharding = splits DATA across nodes (each shard has different data)
  Replication = copies DATA across nodes (each replica has same data)

  Typical production setup: BOTH

  +------------------+  +------------------+  +------------------+
  |   Shard 0        |  |   Shard 1        |  |   Shard 2        |
  | +-----------+    |  | +-----------+    |  | +-----------+    |
  | | Primary   |    |  | | Primary   |    |  | | Primary   |    |
  | +-----------+    |  | +-----------+    |  | +-----------+    |
  | | Replica 1 |    |  | | Replica 1 |    |  | | Replica 1 |    |
  | +-----------+    |  | +-----------+    |  | +-----------+    |
  | | Replica 2 |    |  | | Replica 2 |    |  | | Replica 2 |    |
  | +-----------+    |  | +-----------+    |  | +-----------+    |
  +------------------+  +------------------+  +------------------+

  Sharding gives: write scalability, storage scalability
  Replication gives: read scalability, fault tolerance

Summary

  When to shard:
  - Single node can't hold all data
  - Single node can't handle write throughput
  - You need geographic data locality

  When NOT to shard:
  - You can still vertically scale (cheaper, simpler)
  - Read replicas solve your problem (read-heavy workload)
  - Your data fits in one node with room to grow
  - You need lots of cross-shard queries (sharding won't help)

Sharding adds significant complexity: cross-shard queries, distributed transactions, resharding operations, and operational overhead. The systems that do it well (TiDB, CockroachDB, Vitess, DynamoDB) invest heavily in automating these challenges so application developers can focus on their domain logic rather than data placement.

References

Designing Data-Intensive Applications, Ch. 6 (Kleppmann) book
Vitess architecture and VReplication docs
DynamoDB partition key design docs
TiDB Region splitting and scheduling docs
CockroachDB range partitioning docs
Consistent hashing (Karger et al.) paper
Cassandra data partitioning docs
MongoDB sharding architecture docs