Table of contents
Open Table of contents
- Context
- The Commit Log: Kafka’s Core Abstraction
- How Messages Land on Disk
- How the Offset Index Works
- Replication: How Kafka Survives Failures
- Producer: How Messages Get Into Kafka
- Consumer Groups: How Messages Get Out of Kafka
- Zero-Copy: Why Kafka Is Fast
- Putting It All Together: A Record’s Journey
- Key Design Decisions That Make Kafka Work
- References
Context
Imagine you are building a food delivery app. Orders come in from customers. The kitchen needs to know about new orders. The delivery team needs to know when food is ready. The billing system needs to charge the customer. The analytics dashboard wants to track order volume.
You could wire each system directly to every other system, but that quickly becomes a mess:
Customer App ----> Kitchen System
Customer App ----> Billing System
Customer App ----> Analytics
Kitchen System --> Delivery System
Kitchen System --> Analytics
Billing System --> Analytics
...
Every time you add a new system, you have to update all the others. This is the coupling problem. What you really want is a central place where events land, and each downstream system picks up what it needs at its own pace.
That central place is Apache Kafka — a distributed, fault-tolerant commit log that decouples producers (systems that generate events) from consumers (systems that process events).
Apache Kafka Cluster
+-----------+ +-----------------------------------+ +-----------+
| Customer |---->| |---->| Kitchen |
| App | | Kafka Brokers | | System |
+-----------+ | | +-----------+
| +-------+ +-------+ +-------+ |
+-----------+ | |Broker | |Broker | |Broker | | +-----------+
| Kitchen |---->| | 0 | | 1 | | 2 | |---->| Delivery |
| System | | +-------+ +-------+ +-------+ | | System |
+-----------+ | | +-----------+
| Topics split into partitions, |
| replicated across brokers | +-----------+
| |---->| Analytics |
+-----------------------------------+ +-----------+
Kafka was originally built at LinkedIn in 2010 to handle the firehose of activity events (page views, searches, connections) flowing between hundreds of services. It was open-sourced in 2011 and became an Apache top-level project in 2012. Today it handles trillions of messages per day at companies like LinkedIn, Netflix, Uber, and Pinterest.
Let’s open the hood and see how it works.
The Commit Log: Kafka’s Core Abstraction
At its heart, Kafka is a distributed, append-only commit log. If you’ve worked with databases, you know about the Write-Ahead Log (WAL) — a sequential file where every change is recorded before it’s applied. Kafka takes this idea and makes it the entire product.
A topic is a named stream of events (like orders, payments, page-views). Each topic is split into one or more partitions. Each partition is an independent, ordered, append-only log:
Topic: "orders"
Partition 0: [0] [1] [2] [3] [4] [5] [6] [7] ...
Partition 1: [0] [1] [2] [3] [4] [5] ...
Partition 2: [0] [1] [2] [3] [4] [5] [6] [7] [8] ...
^ ^
| |
oldest newest
message message
Each box is a record with an offset (sequential ID).
Order is guaranteed WITHIN a partition, not across partitions.
Why partitions? Parallelism. If you have one giant log, only one machine can write to it and one machine can read from it. By splitting a topic into partitions, you spread the load across multiple brokers (Kafka servers). Each partition lives on a different broker, so multiple producers and consumers can work in parallel.
How Messages Land on Disk
When a producer sends a record to a partition, the broker appends it to the active log segment on disk. A partition’s data is stored as a sequence of segment files:
Partition directory: /data/kafka/orders-0/
00000000000000000000.log <-- segment file (actual messages)
00000000000000000000.index <-- offset index
00000000000000000000.timeindex <-- timestamp index
00000000000000345678.log <-- next segment (starts at offset 345678)
00000000000000345678.index
00000000000000345678.timeindex
Each segment is named after its base offset — the first offset it contains. The three files for each segment serve different purposes:
.log— The actual message bytes, written sequentially. Each record contains a key, value, timestamp, headers, and offset..index— A sparse index mapping offsets to byte positions in the.logfile. “Sparse” means it stores an entry every N records (configurable), not every record..timeindex— Maps timestamps to offsets, enabling time-based lookups like “give me all messages after 2pm.”
The key class that manages a single segment is LogSegment:
public class LogSegment implements Closeable {
private final FileRecords log; // the .log file
private final OffsetIndex offsetIndex; // the .index file
private final TimeIndex timeIndex; // the .timeindex file
private final long baseOffset; // first offset in this segment
// Append a batch of records to this segment
public void append(long largestOffset, long largestTimestamp,
long shiftingOffset, MemoryRecords records) {
int appendedBytes = log.append(records);
// Update the offset index (sparse, every indexIntervalBytes)
if (bytesSinceLastIndexEntry > indexIntervalBytes) {
offsetIndex.append(largestOffset, physicalPosition);
timeIndex.maybeAppend(largestTimestamp, largestOffset);
bytesSinceLastIndexEntry = 0;
}
}
}
When a segment grows beyond the configured size (default 1 GB) or age, the broker “rolls” it — closes the current segment and creates a new one. Old segments are eligible for deletion or compaction based on the topic’s retention policy.
The full partition log is managed by UnifiedLog, which coordinates multiple segments and tracks key offsets:
UnifiedLog for partition "orders-0"
+------------------------------------------------------------+
| |
| logStartOffset highWatermark logEndOffset |
| | | | |
| v v v |
| [0] [1] ... [345677] [345678] ... [345700] [345701] |
| |--- segment 0 ---| |------ active segment --------| |
| |
| logStartOffset: earliest available (older data deleted) |
| highWatermark: last offset replicated to all ISR |
| logEndOffset: next offset to be written (LEO) |
+------------------------------------------------------------+
The high watermark is crucial: consumers can only read up to this point. It represents the last offset that has been safely replicated to all in-sync replicas. We’ll see why this matters in the replication section.
How the Offset Index Works
When a consumer asks to read from offset 345,690, Kafka needs to find the byte position in the .log file quickly. It does a two-step lookup:
Consumer requests: "read from offset 345690"
Step 1: Binary search the .index file
+-------------------------------------+
| offset | position (bytes in .log) |
|---------|---------------------------|
| 345678 | 0 |
| 345682 | 4096 | <-- largest entry <= 345690
| 345695 | 12288 |
| ... | ... |
+-------------------------------------+
Result: start scanning from byte 4096
Step 2: Sequential scan from byte 4096 in .log
until we find offset 345690
The index files are memory-mapped (mmap), so the OS kernel manages caching them in RAM. This is why Kafka doesn’t need a large JVM heap — the OS page cache does the heavy lifting.
The OffsetIndex class handles this:
public class OffsetIndex extends AbstractIndex {
// Each entry is 8 bytes: 4 bytes relative offset + 4 bytes position
private static final int ENTRY_SIZE = 8;
// Binary search to find the largest offset <= targetOffset
public OffsetPosition lookup(long targetOffset) {
// ... binary search over memory-mapped entries ...
}
}
Each index entry is tiny: 4 bytes for a relative offset (relative to the segment’s base offset, saving space) and 4 bytes for the file position. This keeps the index small enough to stay in memory even for very large partitions.
Replication: How Kafka Survives Failures
A single copy of your data on one machine is a disaster waiting to happen. Kafka replicates every partition across multiple brokers. When you create a topic, you set a replication factor (commonly 3):
Topic: "orders", 3 partitions, replication factor = 3
Broker 0 Broker 1 Broker 2
+-------------+ +-------------+ +-------------+
| P0 (leader) | | P0 (follower)| | P0 (follower)|
| P1 (follower)| | P1 (leader) | | P1 (follower)|
| P2 (follower)| | P2 (follower)| | P2 (leader) |
+-------------+ +-------------+ +-------------+
Each partition has exactly one leader and N-1 followers.
All reads and writes go through the leader.
The ISR (In-Sync Replicas) Set
Not all followers are necessarily up-to-date. Kafka tracks which replicas are “in sync” with the leader in a set called the ISR (In-Sync Replicas):
Partition 0 replication state:
Leader (Broker 0): offset 0 1 2 3 4 5 6 7 8 9
^
LEO = 10
Follower (Broker 1): offset 0 1 2 3 4 5 6 7
^
LEO = 8 (in ISR)
Follower (Broker 2): offset 0 1 2 3
^
LEO = 4 (REMOVED from ISR)
High Watermark = min(LEO of all ISR members) = 8
Consumers can read up to offset 7.
A follower stays in the ISR as long as it has fetched data from the leader within replica.lag.time.max.ms (default 30 seconds). If a follower falls behind (due to slow disk, network issues, or GC pauses), the leader removes it from the ISR.
The ReplicaManager tracks ISR membership:
// Periodically check if any replica should be removed from ISR
def maybeShrinkIsr(): Unit = {
// For each partition this broker leads:
// Check each replica's last fetch time
// If (now - lastFetchTime) > replicaLagTimeMaxMs:
// Remove replica from ISR
// Update high watermark
}
Leader Election
When a leader broker goes down, the cluster controller picks a new leader from the ISR. This is fast because ISR members are guaranteed to have all committed data:
Before failure:
ISR = {Broker 0 (leader), Broker 1, Broker 2}
Broker 0 crashes:
Controller detects via heartbeat timeout
New leader = first available broker in ISR
ISR = {Broker 1 (new leader), Broker 2}
Broker 0 comes back:
Truncates its log to the high watermark
Fetches missing data from new leader
Rejoins ISR once caught up
In older versions of Kafka, the controller ran on a single broker elected via ZooKeeper. Since Kafka 3.3 (2022), the controller uses KRaft (Kafka Raft) — a built-in Raft consensus protocol — eliminating the ZooKeeper dependency entirely.
Producer: How Messages Get Into Kafka
When your application calls producer.send(record), the record doesn’t immediately fly over the network. The producer client batches records for efficiency.
The RecordAccumulator
The RecordAccumulator sits inside the producer client and groups records by destination partition:
Producer Application
+--------------------------------------------------+
| |
| producer.send(order1) --+ |
| producer.send(order2) --+--> RecordAccumulator |
| producer.send(order3) --+ | |
| | |
| +----------+----------+ |
| | | |
| Partition 0 Partition 1
| batch: batch:
| [order1, order3] [order2]
| | | |
| +----------+----------+ |
| | |
| Sender thread |
| (background, async) |
| | |
+------------------------------------+-------------+
|
v
Broker (network)
The accumulator uses a BufferPool bounded by buffer.memory (default 32 MB) to manage memory. Records accumulate until one of these triggers fires:
- Batch full — the batch reaches
batch.size(default 16 KB). - Linger expired —
linger.mshas elapsed since the first record was added (default 0ms, meaning send immediately). - Memory pressure — the buffer pool is nearly full.
- Explicit flush — the application calls
producer.flush().
public class RecordAccumulator {
private final BufferPool free; // bounded memory pool
private final ConcurrentMap<TopicPartition,
Deque<ProducerBatch>> batches; // per-partition batch queue
public RecordAppendResult append(String topic, int partition,
byte[] key, byte[] value, ...) {
// Find or create a batch for this partition
Deque<ProducerBatch> dq = getOrCreateDeque(tp);
// Try to append to the last batch
ProducerBatch last = dq.peekLast();
if (last != null) {
FutureRecordMetadata future = last.tryAppend(key, value, ...);
if (future != null) return new RecordAppendResult(future, ...);
}
// Batch is full or doesn't exist — allocate a new one
ByteBuffer buffer = free.allocate(batchSize);
ProducerBatch batch = new ProducerBatch(tp, buffer);
dq.addLast(batch);
// ...
}
}
Acknowledgment Levels (acks)
The acks configuration controls how many brokers must confirm a write before the producer considers it successful:
acks=0: Fire and forget
+----------+ +--------+
| Producer |---send-------->| Leader | No response waited.
+----------+ +--------+ Fastest, but data can be lost.
acks=1: Leader confirms
+----------+ +--------+
| Producer |---send-------->| Leader |---> writes to local log
+----------+<------ack------+--------+ Leader crash before replication
means data loss.
acks=all (-1): Full ISR confirms
+----------+ +--------+ +----------+ +----------+
| Producer |---send-------->| Leader |---->| Follower | | Follower |
+----------+ +--------+ +----------+ +----------+
| wait for all ISR to replicate...
+----------+<------ack---------+
| Producer | Slowest, but no data loss as
+----------+ long as at least one ISR survives.
With acks=all, the leader waits until every broker in the ISR has written the record before responding. Combined with min.insync.replicas=2, this guarantees that at least 2 copies exist before the producer gets a success response.
Consumer Groups: How Messages Get Out of Kafka
A consumer group is a set of consumer instances that cooperate to read from a topic. Kafka assigns each partition to exactly one consumer in the group, so records are processed in parallel without duplication:
Topic "orders" with 4 partitions
Consumer Group: "kitchen-service"
+------+ +------+ +------+ +------+
| P0 | | P1 | | P2 | | P3 |
+--+---+ +--+---+ +--+---+ +--+---+
| | | |
v v v v
+-------+ +-------+ +-------+ +-------+
|Consumer| |Consumer| |Consumer| |Consumer|
| A | | B | | B | | C |
+-------+ +-------+ +-------+ +-------+
Consumer A reads P0, Consumer B reads P1+P2, Consumer C reads P3.
If Consumer B crashes, its partitions are reassigned to A and C.
Rebalancing
When a consumer joins or leaves a group, Kafka triggers a rebalance — partitions are redistributed among the remaining consumers. The process is coordinated by a GroupCoordinator, one of the brokers designated for this group.
Rebalance triggered (Consumer B crashed)
1. Remaining consumers send JoinGroup to coordinator
2. Coordinator picks a "leader" consumer
3. Leader consumer runs the partition assignment strategy
(e.g., RangeAssignor, RoundRobinAssignor)
4. Leader sends assignments to coordinator
5. Coordinator sends SyncGroup to each consumer with its partitions
Before: A=[P0] B=[P1,P2] C=[P3]
After: A=[P0,P1] (dead) C=[P2,P3]
The GroupCoordinator manages the lifecycle:
public interface GroupCoordinator {
// Consumer joins the group
CompletableFuture<JoinGroupResponseData> joinGroup(JoinGroupRequestData request);
// Consumer syncs its assignment
CompletableFuture<SyncGroupResponseData> syncGroup(SyncGroupRequestData request);
// Consumer commits its progress
CompletableFuture<OffsetCommitResponseData> commitOffsets(OffsetCommitRequestData request);
// Consumer sends heartbeat to stay alive
CompletableFuture<HeartbeatResponseData> heartbeat(HeartbeatRequestData request);
}
Offset Tracking
Each consumer tracks which records it has processed by committing offsets to a special internal topic called __consumer_offsets:
Consumer A reads from partition 0:
Partition 0: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
^ ^
| |
committed offset current position
(saved in __consumer_offsets)
If Consumer A crashes and restarts (or another consumer takes over),
it reads the committed offset and resumes from offset 5.
The __consumer_offsets topic has 50 partitions by default. A consumer group’s coordinator is the broker that leads the partition determined by hash(group_id) % 50. This spreads coordinator load across the cluster.
Offset commits are just regular Kafka messages written to __consumer_offsets, which means they are replicated and durable just like any other topic.
Zero-Copy: Why Kafka Is Fast
When a consumer fetches data, a naive implementation would:
Disk --> Kernel Buffer --> User Space (JVM) --> Kernel Buffer --> Network
4 copies, 2 context switches. Slow.
Kafka avoids this using zero-copy transfer via Java NIO’s FileChannel.transferTo(), which maps to the Linux sendfile() system call:
Disk --> Kernel Buffer -----> Network
(DMA) (DMA, direct)
0 copies through user space. 2 DMA transfers only.
The data goes straight from the OS page cache to the network interface card, never touching the JVM heap. This is why Kafka can serve consumers at disk-sequential-read speed (hundreds of MB/s per partition) with minimal CPU overhead.
This works because Kafka stores messages in the same binary format on disk, in memory, and over the network. There is no deserialization or transformation on the broker — it simply moves bytes from file to socket.
Putting It All Together: A Record’s Journey
Let’s trace a single order event from producer to consumer:
Step 1: Producer sends record
+----------+
| Producer | record: {key: "order-42", value: "{...}"}
+----+-----+
|
| 1a. Serializer converts key and value to bytes
| 1b. Partitioner picks partition (hash(key) % numPartitions)
| hash("order-42") % 3 = partition 1
| 1c. RecordAccumulator batches it with other partition-1 records
| 1d. Sender thread ships batch to Broker 1 (leader of partition 1)
|
v
Step 2: Leader broker appends to log
+----------+
| Broker 1 | (leader of orders-1)
+----+-----+
|
| 2a. Append batch to active LogSegment (.log file)
| 2b. Update offset index (.index) and time index (.timeindex)
| 2c. Assign offset 5678 to this record
| 2d. Follower brokers fetch and replicate the data
| 2e. Once ISR confirms, advance high watermark
| 2f. If acks=all, send acknowledgment to producer
|
v
Step 3: Consumer reads the record
+----------+
| Consumer | (consumer group "kitchen-service", assigned partition 1)
+----+-----+
|
| 3a. Fetch request: "give me records from partition 1, offset 5678"
| 3b. Broker finds the segment containing offset 5678
| 3c. Binary search the .index to find byte position
| 3d. Zero-copy transfer from .log file to network socket
| 3e. Consumer deserializes and processes the record
| 3f. Consumer commits offset 5679 to __consumer_offsets
|
Done! The kitchen system knows about order 42.
Key Design Decisions That Make Kafka Work
Five design choices set Kafka apart:
Decision Why it matters
------------------------- ------------------------------------------
1. Append-only log Sequential disk I/O is ~100x faster than
random I/O. Kafka writes never seek.
2. Partitioning Horizontal scaling: add brokers and
partitions to increase throughput linearly.
3. Pull-based consumers Consumers control their own pace. A slow
analytics job doesn't block a fast kitchen
service. Consumers can even rewind.
4. Retention, not deletion Records stay for a configurable time
(default 7 days) regardless of whether
they've been consumed. Multiple consumer
groups can read independently.
5. Zero-copy I/O The broker is a "dumb pipe" that moves
bytes efficiently. Serialization lives in
clients, not the broker.
References
- Apache Kafka documentation — Design doc
- Kafka: a Distributed Messaging System for Log Processing (original paper) paper
- LogSegment implementation
storage/.../log/LogSegment.java - UnifiedLog implementation
storage/.../log/UnifiedLog.java - OffsetIndex implementation
storage/.../log/OffsetIndex.java - RecordAccumulator (producer batching)
clients/.../producer/internals/RecordAccumulator.java - GroupCoordinator (consumer groups)
group-coordinator/.../GroupCoordinator.java - ReplicaManager (replication)
core/.../server/ReplicaManager.scala - KRaft: Apache Kafka Without ZooKeeper KIP-500