Table of contents
Open Table of contents
- Context
- How the Hash Tree is Built
- Merkle Proofs: Verifying Without the Full Tree
- Application 1: Git Object Storage
- Application 2: Bitcoin Block Verification
- Application 3: Anti-Entropy in Distributed Databases
- Application 4: Certificate Transparency
- Complexity Summary
- A Minimal Implementation
- Why Not Just Hash Everything Together?
- References
Context
Imagine you download a 4 GB file from the internet. How do you know it arrived intact? The simplest approach is to hash the entire file and compare against a known checksum. But what if only one byte in the middle is corrupted — do you re-download all 4 GB?
This is the problem Ralph Merkle solved in 1979 with a data structure now called a Merkle tree (or hash tree). The idea is elegant: organize your data into a binary tree where every leaf is a hash of a data block, and every internal node is the hash of its children. The single hash at the root — the Merkle root — acts as a fingerprint of the entire dataset.
Merkle Tree (4 data blocks)
+----------+
| Root Hash|
| H(AB|CD)|
+----+-----+
|
+-------------+-------------+
| |
+----+----+ +-----+----+
| H(AB) | | H(CD) |
| H(A|B) | | H(C|D) |
+----+----+ +-----+----+
| |
+-----+-----+ +------+-----+
| | | |
+--+--+ +--+--+ +--+--+ +--+--+
| H(A)| | H(B)| | H(C)| | H(D)|
+--+--+ +--+--+ +--+--+ +--+--+
| | | |
+--+--+ +--+--+ +--+--+ +--+--+
|Blk A| |Blk B| |Blk C| |Blk D|
+-----+ +-----+ +-----+ +-----+
If block C is corrupted, you don’t need to re-verify or re-download blocks A, B, and D. You only need to check the path from block C up to the root — that’s hashes instead of .
How the Hash Tree is Built
Building a Merkle tree is a bottom-up process:
Step 1: Partition data into blocks. Split your data into fixed-size chunks (e.g., 256 KB each). Each chunk is a leaf.
Step 2: Hash each leaf. Apply a cryptographic hash function (SHA-256, for instance) to each block:
H(A) = SHA256(block_A)
H(B) = SHA256(block_B)
H(C) = SHA256(block_C)
H(D) = SHA256(block_D)
Step 3: Pair and hash upward. Concatenate adjacent hashes and hash again:
H(AB) = SHA256( H(A) || H(B) )
H(CD) = SHA256( H(C) || H(D) )
Step 4: Repeat until one root remains:
Root = SHA256( H(AB) || H(CD) )
If the number of leaves is odd, the last leaf is duplicated (or carried up alone, depending on the implementation).
Merkle Proofs: Verifying Without the Full Tree
The power of a Merkle tree is that you can prove a specific block belongs to the dataset by providing only hashes — called a Merkle proof (or inclusion proof).
Suppose a client knows the Merkle root and wants to verify block C is authentic. The server provides:
Proof for block C:
1. H(D) — sibling at leaf level
2. H(AB) — sibling at the next level up
Verification:
1. Compute H(C) = SHA256(block_C)
2. Compute H(CD) = SHA256( H(C) || H(D) )
3. Compute Root' = SHA256( H(AB) || H(CD) )
4. Compare Root' == known Root
Root ← compare this
/ \
H(AB) H(CD) ← compute this
[given] / \
H(C) H(D)
↑ [given]
compute
from data
With 1 million blocks, verification requires only ~20 hashes (since ). This makes Merkle proofs extremely efficient for large datasets.
Application 1: Git Object Storage
Git stores everything (blobs, trees, commits) as content-addressed objects. A tree object in git is essentially a Merkle tree node:
$ git cat-file -p HEAD^{tree}
100644 blob a1b2c3d4... README.md
100644 blob e5f6a7b8... main.go
040000 tree 9c8d7e6f... pkg/
Each blob hash is a leaf (SHA-1 of the file content). Each tree hash is computed from its entries. The commit hash includes the root tree hash.
commit abc123
|
tree def456 (root of file tree)
/ \
tree 9c8d.. blob a1b2.. blob e5f6..
(pkg/) (README.md) (main.go)
/ \
blob.. blob..
When you git push, the remote can quickly determine which objects it already has by comparing tree hashes top-down. If the root tree hash matches, nothing changed. If it differs, git walks down only the branches that changed — transferring minimal data.
Application 2: Bitcoin Block Verification
Every Bitcoin block header contains a Merkle root of all transactions in that block. This enables Simplified Payment Verification (SPV) — lightweight clients can verify a transaction was included in a block without downloading every transaction:
Block Header
+------------------+
| prev_block_hash |
| timestamp |
| nonce |
| merkle_root -----+---> Root of transaction tree
+------------------+
Transactions in block (as Merkle leaves):
tx0, tx1, tx2, tx3, tx4, tx5, tx6, tx7
An SPV client stores only block headers (~80 bytes each). To verify tx5 was included, it requests a Merkle proof from a full node: about 3 sibling hashes for a block of 8 transactions. The client computes up to the root and compares against the block header’s merkle_root.
Bitcoin uses double-SHA256: SHA256(SHA256(data)) at each level, and duplicates the last element when the count is odd.
Application 3: Anti-Entropy in Distributed Databases
Distributed databases like Apache Cassandra and Amazon DynamoDB use Merkle trees for anti-entropy repair — detecting and fixing data inconsistencies between replicas.
Replica A Replica B
+--------+ +--------+
| Root_A | -------- compare ------> | Root_B |
+---+----+ +---+----+
| Root_A != Root_B |
+---+----+ +---+----+
|H(AB)_A | H(AB) match |H(AB)_B | ← skip this subtree
+--------+ +--------+
|H(CD)_A | H(CD) mismatch |H(CD)_B | ← drill down
+---+----+ +---+----+
| |
+---+---+ +---+---+
|H(C)_A | mismatch |H(C)_B | ← sync block C
+-------+ +-------+
|H(D)_A | match |H(D)_B | ← skip
+-------+ +-------+
Each replica builds a Merkle tree over its key ranges. They exchange only root hashes first. If roots match, the replicas are consistent — zero data transfer needed. If not, they walk down the tree, exchanging hashes at each level until they find exactly which key ranges differ.
From Cassandra’s source (MerkleTree.java):
public class MerkleTree implements Serializable {
private Hashable root;
private final int hashdepth; // max depth of tree
private final long maxsize; // max number of leaves
// Each inner node stores hash of its children
static class Inner extends Hashable {
private Hashable left;
private Hashable right;
@Override
public void hash(byte[] hashes) {
MessageDigest md = MessageDigest.getInstance("SHA-256");
md.update(left.hash());
md.update(right.hash());
this.digest = md.digest();
}
}
}
Application 4: Certificate Transparency
Google’s Certificate Transparency (CT) project uses Merkle trees to create a public, append-only log of TLS certificates. Anyone can verify that a certificate was logged (inclusion proof) or that the log hasn’t been tampered with (consistency proof).
A consistency proof shows that a new tree is an extension of an older tree — no entries were modified or removed:
Old tree (4 leaves): New tree (6 leaves):
R_old R_new
/ \ / \
H(AB) H(CD) H(ABCD) H(EF)
/ \ / \ / \ / \
A B C D H(AB) H(CD) E F
/ \ / \
A B C D
The consistency proof from old root to new root requires only a few hashes — proving the first 4 leaves are unchanged in the new tree.
Complexity Summary
| Operation | Time | Space for proof |
|---|---|---|
| Build tree | — | |
| Compute root | — | |
| Generate inclusion proof | hashes | |
| Verify inclusion proof | — | |
| Generate consistency proof | hashes | |
| Find differing blocks (anti-entropy) | — |
Where = number of leaves, = number of differing blocks.
A Minimal Implementation
Here’s a simplified Merkle tree in Python to make the algorithm concrete:
import hashlib
def sha256(data: bytes) -> bytes:
return hashlib.sha256(data).digest()
def build_merkle_tree(blocks: list[bytes]) -> list[list[bytes]]:
"""Build tree bottom-up. Returns list of levels (leaves first)."""
if not blocks:
return []
# Level 0: hash each block
current_level = [sha256(b) for b in blocks]
tree = [current_level]
# Build upward until one root
while len(current_level) > 1:
next_level = []
for i in range(0, len(current_level), 2):
left = current_level[i]
# Duplicate last if odd number
right = current_level[i + 1] if i + 1 < len(current_level) else left
next_level.append(sha256(left + right))
current_level = next_level
tree.append(current_level)
return tree
def get_merkle_proof(tree: list[list[bytes]], index: int) -> list[tuple[bytes, str]]:
"""Return sibling hashes needed to verify leaf at index."""
proof = []
for level in tree[:-1]: # skip root level
if index % 2 == 0:
sibling_idx = index + 1
direction = "right"
else:
sibling_idx = index - 1
direction = "left"
if sibling_idx < len(level):
proof.append((level[sibling_idx], direction))
else:
proof.append((level[index], "right")) # duplicate
index //= 2
return proof
def verify_proof(block: bytes, proof: list[tuple[bytes, str]], root: bytes) -> bool:
"""Verify a block belongs to the tree with the given root."""
current = sha256(block)
for sibling, direction in proof:
if direction == "right":
current = sha256(current + sibling)
else:
current = sha256(sibling + current)
return current == root
Usage:
blocks = [b"tx0", b"tx1", b"tx2", b"tx3"]
tree = build_merkle_tree(blocks)
root = tree[-1][0]
# Prove tx2 is in the tree
proof = get_merkle_proof(tree, 2)
assert verify_proof(b"tx2", proof, root) # True
# Tampered data fails
assert not verify_proof(b"tx2_fake", proof, root) # False
Why Not Just Hash Everything Together?
You might wonder: why not concatenate all blocks and hash once? That gives a single checksum too. The difference is granularity:
Flat hash: H(A || B || C || D)
- Verify: must have ALL data, O(n) work
- Pinpoint corruption: impossible without all data
- Sync: must transfer everything
Merkle tree: H( H(A|B) || H(C|D) )
- Verify one block: O(log n) proof
- Pinpoint corruption: walk the tree O(log n)
- Sync: transfer only differing blocks
The tree structure gives you logarithmic verification and surgical data repair — critical when your dataset is terabytes across thousands of machines.
References
- Merkle, R. (1979). “A Certified Digital Signature”. Advances in Cryptology — CRYPTO ‘89.
- Nakamoto, S. (2008). Bitcoin: A Peer-to-Peer Electronic Cash System. Section 8: Simplified Payment Verification.
- DeCandia, G. et al. (2007). “Dynamo: Amazon’s Highly Available Key-value Store”. SOSP ‘07.
- Laurie, B. et al. (2013). RFC 6962 — Certificate Transparency. IETF.
- Apache Cassandra source:
MerkleTree.java. - Git internals: Pro Git — Git Objects.