System Design - How Prometheus TSDB (Time-Series Database) Works

Open Table of contents

Context
The Two-Layer Storage Model
The Head Block: Where Writes Land
The Write-Ahead Log (WAL)
- WAL Checkpointing
Persistent Blocks: On-Disk Format
- The Index File
- Chunk Files
Compaction: Merging Blocks
Query Path: How Reads Work
Memory-Mapping: Balancing Memory and Disk
Handling High Cardinality
Retention and Deletion
Performance Characteristics
How It All Fits Together
References

Context

Monitoring systems need to answer questions like “what was the CPU usage of service X over the last hour?” or “how many requests per second hit endpoint Y yesterday?” These questions require storing time-series data — sequences of (timestamp, value) pairs associated with a metric name and labels.

Prometheus is the most widely adopted open-source monitoring system. It scrapes metrics from targets (your services) at regular intervals (default 15 seconds), stores the data locally, and serves queries via its PromQL language. The storage engine that makes this possible is called the TSDB (Time-Series Database).

The TSDB design was introduced in Prometheus 2.0 (2017), replacing the earlier per-series files approach. It draws inspiration from ideas in LSM trees (log-structured merge) and columnar databases, adapted for the specific access patterns of monitoring data: high write throughput of many concurrent series, append-only, and queries that scan contiguous time ranges.

              Prometheus Architecture (simplified)

  +-----------+   +-----------+   +-----------+
  |  Service  |   |  Service  |   |  Service  |   your applications
  |  /metrics |   |  /metrics |   |  /metrics |   expose metrics
  +-----+-----+   +-----+-----+   +-----+-----+
        |               |               |
        +-------+-------+-------+-------+
                |               |
                v               v
         +-----------------------------+
         |        Prometheus           |
         |                             |
         |  +-------+   +-----------+  |
         |  | Scrape|   |  PromQL   |  |
         |  | Loop  |   |  Engine   |  |
         |  +---+---+   +-----+-----+  |
         |      |              |        |
         |      v              v        |
         |  +----------------------+    |
         |  |        TSDB          |    |
         |  |  (the storage layer) |    |
         |  +----------------------+    |
         +-----------------------------+

  Writes: up to millions of samples/sec
  Reads: range queries over hours/days

Let’s dig into how the TSDB actually stores and retrieves this data.

The Two-Layer Storage Model

Prometheus TSDB divides time into blocks, each covering a fixed time range (default 2 hours). At any moment, there are two kinds of storage:

Head block — the current, mutable in-memory block receiving live writes.
Persistent blocks — older, immutable, on-disk blocks that have been “cut” from the head.

  Time -->
  |<-- 2h -->|<-- 2h -->|<-- 2h -->|<-- ongoing -->|
  +-----------+-----------+-----------+--------------+
  |  Block 1  |  Block 2  |  Block 3  |    Head     |
  |  (disk)   |  (disk)   |  (disk)   |  (memory)   |
  | immutable | immutable | immutable |   mutable    |
  +-----------+-----------+-----------+--------------+

When the head block covers more than the configured range (2 hours), Prometheus “cuts” it — serializes the data to disk as a new persistent block, and the head starts fresh. This is similar to how an LSM tree flushes its memtable to an SSTable.

The Head Block: Where Writes Land

Every scraped sample first enters the head block. The head is an in-memory data structure optimized for concurrent appends from many series simultaneously.

Series and Chunks

Each unique combination of metric name + labels is a series. For example, http_requests_total{method="GET", handler="/api"} is one series. The head maintains a hash map from label sets to series objects:

  Head Block
  +---------------------------------------------------+
  |                                                   |
  |  seriesHashMap                                    |
  |  +---------------------------------------------+ |
  |  | hash(labels) --> *memSeries                  | |
  |  |                                              | |
  |  |  "http_requests{method=GET}" --> series_1    | |
  |  |  "http_requests{method=POST}" --> series_2   | |
  |  |  "node_cpu_seconds{cpu=0}" --> series_3      | |
  |  |  ...                                         | |
  |  +---------------------------------------------+ |
  |                                                   |
  |  Each memSeries:                                  |
  |  +-------------------+                            |
  |  | labels            |                            |
  |  | ref (series ID)   |                            |
  |  | headChunk ------->+--> active chunk (XOR enc)  |
  |  | prevChunks []     |    being appended to       |
  |  +-------------------+                            |
  +---------------------------------------------------+

Each memSeries holds a chain of chunks. A chunk is a compressed buffer of (timestamp, value) pairs for a contiguous time range. The active chunk (called headChunk) accepts new appends. When it’s full or has covered enough time (~120 samples or 2 hours range within the head), a new chunk starts.

The Appender Interface

Writes go through the Appender interface defined in tsdb/db.go:

type Appender interface {
    Append(ref storage.SeriesRef, l labels.Labels, t int64, v float64) (storage.SeriesRef, error)
    Commit() error
    Rollback() error
}

A typical write cycle:

The scrape loop collects all samples from one target.
It opens an Appender from the head.
It calls Append() for each sample — the sample is buffered.
It calls Commit() — all buffered samples are written to the WAL and then applied to the in-memory chunks atomically.

The ref (series reference) is a uint64 that acts as a fast lookup shortcut. On the first append for a series, you pass 0 and get back a reference. Subsequent appends reuse this reference to skip label hashing.

Chunk Encoding: XOR Compression

Raw time-series data is highly compressible because adjacent samples tend to have similar timestamps (regular scrape intervals) and similar values (metrics don’t jump wildly). Prometheus uses a double-delta XOR encoding inspired by Facebook’s Gorilla paper (2015).

The idea for values:

  Sample values:   100.5   100.7   100.6   100.8   100.5

  XOR with previous:
    100.5 XOR 100.7 = small number (few bits differ)
    100.7 XOR 100.6 = small number
    ...

  Instead of storing 64-bit floats, store:
    - First value: full 64 bits
    - Subsequent: XOR with previous, then store only the meaningful bits

For timestamps:

  Timestamps (ms):  1000  1015  1030  1045  1060

  Delta:                   15    15    15    15
  Delta-of-delta:           0     0     0     0

  If delta-of-delta is 0: store a single "0" bit
  Regular scrape intervals --> almost all deltas are identical
  --> almost all delta-of-deltas are 0
  --> 1 bit per sample for timestamps!

This encoding achieves roughly 1.37 bytes per sample on typical monitoring data, down from 16 bytes (8 for timestamp + 8 for value). The implementation lives in tsdb/chunkenc/xor.go:

func (a *xorAppender) Append(t int64, v float64) {
    if a.numSamples == 0 {
        // First sample: store full timestamp and value
        a.b.WriteBits(uint64(t), 64)
        a.b.WriteBits(math.Float64bits(v), 64)
    } else {
        a.writeTimestamp(t)
        a.writeValue(v)
    }
    a.numSamples++
    a.t = t
    a.v = v
}

func (a *xorAppender) writeValue(v float64) {
    vDelta := math.Float64bits(v) ^ math.Float64bits(a.v)

    if vDelta == 0 {
        // Same value as previous: store single 0 bit
        a.b.WriteBit(zero)
        return
    }
    // ... store leading zeros, meaningful bits, trailing zeros
}

The Write-Ahead Log (WAL)

The head block is in memory. If Prometheus crashes, all recent data would be lost. The WAL (Write-Ahead Log) solves this — every sample is written to a sequential log file on disk before being applied to memory.

  Write Path

  Scrape loop
       |
       v
  +----------+     +------------------+
  | Appender |---->|  WAL (on disk)   |  sequential writes
  | .Commit()|     |  segments/       |  (fast, append-only)
  +----+-----+     |    000001        |
       |           |    000002        |
       v           |    000003        |
  +----------+     +------------------+
  |   Head   |
  | (memory) |
  +----------+

The WAL is a directory of numbered segment files (default 128MB each). Each segment contains a sequence of records:

Series records: map a new series reference to its label set.
Sample records: a batch of (ref, timestamp, value) tuples.
Tombstone records: mark deleted time ranges.

On startup after a crash, Prometheus replays the WAL from the last checkpoint to reconstruct the head block. The implementation is in tsdb/wlog/wlog.go:

func (w *WL) Log(recs ...[]byte) error {
    w.mtx.Lock()
    defer w.mtx.Unlock()

    for _, rec := range recs {
        // If current segment is full, cut a new one
        if w.curN > w.segmentSize {
            if err := w.cut(); err != nil {
                return err
            }
        }
        // Write record with length prefix and CRC
        if err := w.log(rec); err != nil {
            return err
        }
    }
    return nil
}

WAL Checkpointing

The WAL grows continuously. To prevent unbounded growth, Prometheus periodically creates checkpoints — compressed snapshots of the still-relevant series data. After a checkpoint, older WAL segments are deleted:

  Before checkpoint:
  WAL/
    000001  (contains series + samples for old data)
    000002
    000003
    000004  (most recent)

  After checkpoint at segment 2:
  WAL/
    checkpoint.00002/  (compressed snapshot of live series from 000001-000002)
    000003
    000004

  Segments 000001 and 000002 are deleted.

Persistent Blocks: On-Disk Format

When the head block is cut, its data is serialized into an immutable block on disk. Each block is a directory with this structure:

  data/
  +-- 01BKGV7JC0RY8A6MACW02A2PJD/     <-- block ULID
  |   +-- meta.json                     <-- time range, stats
  |   +-- index                         <-- label index + postings
  |   +-- chunks/
  |   |   +-- 000001                    <-- chunk data files
  |   +-- tombstones                    <-- deleted ranges
  +-- 01BKGTZQ1SYQJTR4PB43C8PD98/
  |   +-- ...
  +-- wal/
      +-- ...

The Index File

The index file is the critical piece for query performance. It maps label names/values to series, and series to their chunk locations. Its structure:

  Index File Layout
  +------------------+
  |  Symbol Table    |  all unique strings (metric names, label values)
  +------------------+
  |  Series          |  sorted list of (labels, chunk_refs[])
  +------------------+
  |  Label Indices   |  for each label name: sorted list of values
  +------------------+
  |  Postings        |  for each label pair: sorted list of series IDs
  +------------------+
  |  Postings        |
  |  Offset Table    |  lookup: label pair --> offset in postings section
  +------------------+
  |  TOC (trailer)   |  offsets to each section above
  +------------------+

Postings are the key concept. A posting list for job="api" contains the sorted IDs of every series with that label. To resolve a query like {job="api", method="GET"}, Prometheus:

Loads the posting list for job="api" → [1, 3, 5, 7, 9, ...]
Loads the posting list for method="GET" → [2, 3, 6, 7, 10, ...]
Intersects them → [3, 7, ...]
For each resulting series ID, reads the chunk references to find the actual data.

This is the same inverted-index approach that search engines use. The implementation is in tsdb/index/index.go.

Chunk Files

Chunk files store the actual compressed sample data. Each chunk is preceded by a small header:

  Chunk file format:
  +--------+--------+----------+---------+
  | series | mint   | maxt     | encoded |
  | ref    | (int64)| (int64)  | data    |
  | (var)  |        |          | (XOR)   |
  +--------+--------+----------+---------+
  | next chunk...                         |
  +---------------------------------------+

The index stores references that point directly into these files (file number + byte offset), so reading a chunk requires a single seek.

Compaction: Merging Blocks

Over time, Prometheus accumulates many small 2-hour blocks. Compaction merges adjacent blocks into larger ones, reducing the number of blocks to scan during queries and enabling better compression.

  Before compaction:
  |<-2h->|<-2h->|<-2h->|<-2h->|<-2h->|<-2h->|
  +------+------+------+------+------+------+
  | Blk1 | Blk2 | Blk3 | Blk4 | Blk5 | Blk6 |
  +------+------+------+------+------+------+

  After compaction (exponential growth):
  |<-------6h-------->|<-------6h-------->|
  +-------------------+-------------------+
  | Compacted Block A | Compacted Block B |
  +-------------------+-------------------+

  Further:
  |<-----------12h----------->|
  +---------------------------+
  | Compacted Block C         |
  +---------------------------+

The compaction strategy uses an exponential scheme — the default progression is 2h → 6h → 18h → 54h, capped at 10% of the retention window. This is defined in tsdb/compact.go:

func ExponentialBlockRanges(minSize int64, steps, stepSize int) []int64 {
    ranges := make([]int64, steps)
    ranges[0] = minSize
    for i := 1; i < steps; i++ {
        ranges[i] = ranges[i-1] * int64(stepSize)
    }
    return ranges
}

During compaction, the engine:

Merges the index files (union of postings, combined series).
Re-encodes chunks for the merged time range.
Applies tombstones (removes deleted data permanently).
Writes a new block directory.
Atomically swaps the old blocks for the new one (by updating meta.json and removing old directories).

Query Path: How Reads Work

A PromQL query like rate(http_requests_total{job="api"}[5m]) triggers this sequence:

  PromQL Engine
       |
       | 1. Determine time range [now-5m, now]
       v
  +------------------+
  |   DB.Querier()   |  returns a Querier spanning all relevant blocks
  +--------+---------+
           |
           | 2. Find which blocks overlap the time range
           v
  +--------+---------+----------+
  | Block A Querier  | Head     |
  | (disk)           | Querier  |
  +--------+---------+----+-----+
           |              |
           | 3. Each querier resolves label matchers via postings
           v              v
  +----------------+  +----------------+
  | Posting lists  |  | Posting lists  |
  | intersect      |  | intersect      |
  | --> series IDs |  | --> series IDs |
  +-------+--------+  +-------+--------+
          |                    |
          | 4. Load chunks for matching series in time range
          v                    v
  +----------------+  +----------------+
  | Chunk iterator |  | Chunk iterator |
  | (from disk)    |  | (from memory)  |
  +-------+--------+  +-------+--------+
          |                    |
          +--------+-----------+
                   |
                   | 5. Merge iterators (time-ordered)
                   v
          +------------------+
          | MergedSeriesSet  |
          +--------+---------+
                   |
                   | 6. Apply PromQL function (rate, sum, etc.)
                   v
              Query Result

The key insight: each block is self-contained with its own index. The query engine creates a querier per block, each independently resolves label matchers to series, then results are merged. This means queries scale with the number of blocks that overlap the time range, not the total data size.

The implementation in tsdb/querier.go:

func (db *DB) Querier(mint, maxt int64) (storage.Querier, error) {
    var blocks []BlockReader
    for _, b := range db.blocks {
        if b.OverlapsClosedInterval(mint, maxt) {
            blocks = append(blocks, b)
        }
    }
    // Always include the head for recent data
    blocks = append(blocks, db.head)

    var queriers []storage.Querier
    for _, b := range blocks {
        q, err := NewBlockQuerier(b, mint, maxt)
        if err != nil {
            return nil, err
        }
        queriers = append(queriers, q)
    }
    return storage.NewMergeQuerier(queriers...), nil
}

Memory-Mapping: Balancing Memory and Disk

Not all chunk data stays in memory. Prometheus memory-maps older chunks from the head block to reduce RAM usage. The head keeps:

The active chunk per series in memory (being appended to).
Older chunks are flushed to a memory-mapped file and accessed on demand via mmap.

  memSeries lifecycle:

  Time -->
  +----------+----------+----------+----------+
  | chunk 1  | chunk 2  | chunk 3  | chunk 4  |  (active)
  | mmapped  | mmapped  | mmapped  | in-memory |
  +----------+----------+----------+----------+
       |          |          |
       v          v          v
  +----------------------------------+
  |  chunks_head/                    |
  |    000001 (mmap'd file)          |
  +----------------------------------+

This means a series with months of data in the head’s time range only consumes memory for the most recent chunk (~120 samples). The mmapped chunks are in the kernel’s page cache — accessed if queried, evicted under memory pressure. Implementation in tsdb/chunks/head_chunks.go.

Handling High Cardinality

Cardinality — the number of unique time series — is the primary scaling challenge. Each series needs an entry in the head’s hash map, a posting list entry, and index space. Prometheus tracks this via the prometheus_tsdb_head_series metric.

The stripeSeries structure in tsdb/head.go shards the series map across 128 stripes to reduce lock contention:

const defaultStripeSize = 128

type stripeSeries struct {
    series [defaultStripeSize]map[chunks.HeadSeriesRef]*memSeries
    hashes [defaultStripeSize]seriesHashmap
    locks  [defaultStripeSize]sync.RWMutex
}

When a scrape target exposes 100,000 series and you have 50 targets, that’s 5 million series — each needing hash map entries, chunk buffers, and WAL records. The practical limit on commodity hardware is roughly 10 million active series before memory and CPU become bottlenecks.

Retention and Deletion

Prometheus supports two retention modes:

Time-based (default 15 days): blocks whose maxTime is older than the retention period are deleted.
Size-based: when total block size exceeds the configured limit, oldest blocks are removed first.

Deletion happens at block granularity — entire block directories are removed. For deleting specific series within a block’s time range, Prometheus writes tombstones rather than rewriting the block. The tombstones are applied during queries (skipping matching ranges) and permanently removed during the next compaction.

Performance Characteristics

Operation	Performance
Write (append)	~1-2 million samples/sec on SSD
WAL write	Sequential, ~500MB/s on NVMe
Query (recent data)	Microseconds (in-memory head)
Query (historical)	Depends on block count + disk IOPS
Compaction	Background, ~100MB/s throughput
Storage per sample	~1.3-1.7 bytes (XOR compressed)
Series lookup	O(1) hash map (head), O(log n) index (blocks)

The write path is extremely fast because:

Appends are in-memory (just incrementing a chunk buffer).
WAL writes are sequential (SSDs excel at this).
No per-sample disk sync (WAL segments are fsynced periodically, not per write).

The trade-off: a crash can lose the last few seconds of data (between WAL fsyncs). For monitoring, this is acceptable.

How It All Fits Together

Here is the complete lifecycle of a sample:

  1. Scrape target --> sample (timestamp, value, labels)
                          |
  2. Appender.Append()    |
                          v
  3. WAL.Log()       +----------+
     (disk, seq)     |   WAL    |  crash safety
                     +----------+
                          |
  4. Head.append()        v
     (memory)        +----------+
                     |   Head   |  serves recent queries
                     | memSeries|
                     |  chunks  |
                     +----+-----+
                          |
  5. After 2h:           v          head "cut"
                     +----------+
                     | Block    |  immutable on disk
                     | (index + |
                     |  chunks) |
                     +----+-----+
                          |
  6. Compaction:         v          merge small blocks
                     +----------+
                     | Larger   |
                     | Block    |
                     +----------+
                          |
  7. Retention:          v          delete old blocks
                        [gone]

References

Prometheus TSDB design doc, Fabian Reinartz (2017) blog
Gorilla: A Fast, Scalable, In-Memory Time Series Database (Facebook, 2015) paper
prometheus/prometheus TSDB implementation tsdb/
Head block implementation tsdb/head.go
XOR chunk encoding tsdb/chunkenc/xor.go
Write-Ahead Log tsdb/wlog/wlog.go
Block compaction tsdb/compact.go
Index file format tsdb/docs/format/index.md
Prometheus storage documentation doc