Table of contents
Open Table of contents
Context
In October 1986, the internet experienced its first congestion collapse. The throughput between Lawrence Berkeley Laboratory and UC Berkeley — nodes connected by a 32 Kbps link — dropped from 32 Kbps to 40 bps. That is a factor of 800x degradation. Packets were being sent, lost, retransmitted, lost again, retransmitted again, flooding the network with copies of data that would never arrive.
Van Jacobson diagnosed the problem and, in his 1988 paper Congestion Avoidance and Control, introduced the algorithms that still form the backbone of TCP congestion control today.
The core insight: a shared network link is a commons. If every sender transmits as fast as possible, the shared buffers in routers overflow, packets are dropped, and everyone suffers. TCP needs a mechanism to:
- Discover how much bandwidth is available.
- Back off when the network is congested.
- Share fairly among competing flows.
All of this must happen without explicit feedback from routers (in classic TCP). The sender must infer congestion from indirect signals: packet loss and round-trip time changes.
The Congestion Collapse Problem
Sender A ───┐
│
Sender B ───┼───[ Router ]───── Bottleneck Link ─────[ Router ]───── Receivers
│ │
Sender C ───┘ │
▼
Buffer overflow!
Packets dropped.
Senders retransmit.
More packets flood in.
Throughput collapses.
The Congestion Window (cwnd)
TCP uses a variable called cwnd (congestion window) to limit how many bytes can be “in flight” — sent but not yet acknowledged. This is distinct from the receive window (rwnd) advertised by the receiver.
The effective sending window is:
effective_window = min(cwnd, rwnd)
The receiver says “I can buffer X bytes” via rwnd. The sender independently tracks “I think the network can handle Y bytes” via cwnd. The sender transmits whichever is smaller.
Sender's View of the Pipe
+-------+-------+-------+-------+-------+-------+-------+
| Seg | Seg | Seg | Seg | Seg | Seg | Seg |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-------+-------+-------+-------+-------+-------+-------+
|<------- acknowledged -------->|<----- in flight ------>|<-- not yet sent
| |
|<-------- cwnd -------->|
When an ACK arrives, it “slides” the window forward, allowing the sender to transmit more data. The art of congestion control is deciding how fast to grow cwnd and how much to shrink it when loss is detected.
Slow Start
Despite the name, slow start is actually exponential growth. When a TCP connection is new (or recovering from a timeout), the sender does not know the network’s capacity. Starting at full speed could immediately cause congestion. So TCP begins conservatively:
- Set
cwnd = IW(initial window, typically 10 segments since RFC 6928). - For every ACK received, increase cwnd by 1 MSS (maximum segment size).
Since each ACK corresponds to one segment sent, and each segment acknowledged lets us send one more, cwnd effectively doubles every RTT:
Slow Start: Exponential Growth of cwnd
cwnd
(segments)
|
64 | xxxxxxxx
| xxxx
32 | xxxx
| xxxx
16 | xxxxx
| xxxx
8 | xxxxx
| xxxx
4 | xxxxx
| xxxx
2 | xxx
| xx
1 |x
+-----|-----|-----|-----|-----|-----|-----|-----> RTTs
0 1 2 3 4 5 6 7
RTT 0: cwnd = 1, send 1 segment
RTT 1: 1 ACK received, cwnd = 2, send 2 segments
RTT 2: 2 ACKs received, cwnd = 4, send 4 segments
RTT 3: 4 ACKs received, cwnd = 8, send 8 segments
...exponential growth until ssthresh
This continues until cwnd reaches ssthresh (slow-start threshold). At that point, TCP transitions to congestion avoidance. The initial ssthresh is typically set to a very large value (effectively infinite), so the first slow start runs until loss is detected.
In the Linux kernel, slow start is implemented in net/ipv4/tcp_cong.c:
void tcp_slow_start(struct tcp_sock *tp, u32 acked)
{
u32 cwnd = min(tp->snd_cwnd + acked, tp->snd_ssthresh);
tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
}
Congestion Avoidance
Once cwnd reaches ssthresh, TCP enters congestion avoidance. Here the growth becomes linear — additive increase:
- For every RTT (i.e., a full window of ACKs), increase cwnd by 1 MSS.
In practice, this is done per-ACK: increase cwnd by MSS * MSS / cwnd for each ACK (which adds up to 1 MSS per RTT).
AIMD: Additive Increase, Multiplicative Decrease
cwnd
| /\ /\ /\
| / \ / \ / \
| / \ / \ / \
| / \ / \ / \
| / \/ \/ \
| / loss! loss! loss!
| /
| / <- slow start
| /
| /
|/
+---------------------------------------------------> time
On loss: cwnd = cwnd / 2 (multiplicative decrease)
After: cwnd += 1/cwnd per ACK (additive increase)
This AIMD (Additive Increase, Multiplicative Decrease) pattern gives TCP its characteristic “sawtooth” shape. It has a beautiful mathematical property: when multiple flows share a bottleneck link, AIMD converges to a fair share for each flow. The multiplicative decrease penalizes large windows proportionally more, while the additive increase gives everyone the same absolute growth.
Detecting Loss
TCP detects congestion through two signals, and they mean different things:
Triple Duplicate ACKs
When a packet is lost but subsequent packets arrive, the receiver sends duplicate ACKs for the last in-order segment. Three duplicate ACKs (4 total identical ACKs) signal that a single segment was lost but the network is still delivering packets. This is considered mild congestion.
Sender Receiver
| |
|------- Seg 1 --------------------->| ACK 2
|------- Seg 2 --------------------->| ACK 3
|------- Seg 3 -----X (lost) |
|------- Seg 4 --------------------->| ACK 3 (dup 1)
|------- Seg 5 --------------------->| ACK 3 (dup 2)
|------- Seg 6 --------------------->| ACK 3 (dup 3) <-- trigger!
| |
| 3 dup ACKs = single loss, network still working
Retransmission Timeout (RTO)
If no ACKs arrive at all and the retransmission timer expires, this signals severe congestion — possibly all packets in flight were lost, or the path is broken.
Signal Severity Response
─────────────────────────────────────────────────────────
3 dup ACKs Mild Fast retransmit + fast recovery
RTO timeout Severe Reset to slow start (cwnd = 1)
Fast Retransmit and Fast Recovery (Reno)
TCP Reno (RFC 5681) introduced two optimizations over the original Tahoe algorithm:
Fast Retransmit
Instead of waiting for the RTO timer to expire, retransmit the lost segment immediately upon receiving 3 duplicate ACKs. This saves an entire RTO worth of idle time.
Fast Recovery
After fast retransmit, instead of dropping cwnd to 1 and entering slow start (which Tahoe does), Reno does:
- Set
ssthresh = cwnd / 2 - Set
cwnd = ssthresh + 3(the 3 dup ACKs represent 3 segments that left the network) - For each additional duplicate ACK, increment cwnd by 1 (each dup ACK means another segment left the network)
- When the retransmitted segment is ACKed, set
cwnd = ssthreshand enter congestion avoidance
TCP Reno: Fast Recovery vs Tahoe
cwnd
|
| Reno (fast recovery)
W |.........*
| / \
| / \_____ cwnd = W/2, then linear growth
| / \_____
| / \______
| /
| /
| / Tahoe (drop to 1)
1 |./.......\
| \___slow start again___/ linear
+---------------------------------------------------> time
^
loss detected (3 dup ACKs)
The key insight: duplicate ACKs are clock signals. Each one tells the sender that a segment left the network (was received out of order by the receiver). This “ACK clock” keeps the pipe full during recovery.
In the kernel, this logic lives in net/ipv4/tcp_input.c in the function tcp_fastretrans_alert().
TCP CUBIC
Since Linux 2.6.19 (2006), the default congestion control algorithm is CUBIC (RFC 9438). It replaces Reno’s linear increase with a cubic function, making it much more aggressive in probing for bandwidth on high-BDP (bandwidth-delay product) networks.
The Cubic Function
After a loss event, CUBIC records the window size at which loss occurred: W_max. It then grows the window according to:
W(t) = C * (t - K)^3 + W_max
where:
C = scaling factor (0.4 in Linux)
t = time since last loss event
K = cubic_root(W_max * beta / C)
beta = multiplicative decrease factor (0.7 for CUBIC)
This produces a distinctive S-shaped curve:
CUBIC Window Growth
cwnd
|
| ___...---*** (probing above W_max)
| __--*
| __--*
W_max|..............*****...........|................
| ** | concave region:
| ** | cautious near W_max
| **
| ** convex region:
| ** aggressive growth
| ** far from W_max
| *
|*
+--|------------|--------------|---> time since loss
0 K 2K
Phase 1 (t < K): Convex growth — fast ramp-up when far below W_max
Phase 2 (t ~ K): Plateau — slow, careful probing near W_max
Phase 3 (t > K): Concave growth — probing above W_max for new capacity
Why CUBIC is Better for High-BDP
Reno’s additive increase adds 1 MSS per RTT. On a 10 Gbps link with 100ms RTT, the BDP is ~83,000 segments. After a 50% loss, Reno needs 41,500 RTTs (69 minutes!) to recover. CUBIC’s cubic function recovers in seconds because its growth rate depends on time, not RTT.
Linux Kernel Implementation
The CUBIC algorithm lives in net/ipv4/tcp_cubic.c. Key structures:
/* From net/ipv4/tcp_cubic.c */
struct bictcp {
u32 cnt; /* increase cwnd by 1 after ACKs */
u32 last_max_cwnd; /* last maximum cwnd (W_max) */
u32 last_cwnd; /* last cwnd */
u32 last_time; /* time when updated last_cwnd */
u32 bic_origin_point; /* origin point of cubic function */
u32 bic_K; /* time to reach origin point from epoch */
u32 delay_min; /* min delay (for Hystart) */
u32 epoch_start; /* beginning of an epoch */
u32 ack_cnt; /* ACK count */
u32 tcp_cwnd; /* estimated tcp cwnd (for friendliness) */
...
};
The core window calculation in bictcp_update():
static void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
{
...
/* Calculate the cubic function value */
/* W(t) = C * (t - K)^3 + W_max */
offs = ca->bic_K - t; /* |t - K| */
delta = (cube_rtt_scale * offs * offs * offs) >> (10 + 3*BICTCP_HZ);
...
}
CUBIC also maintains a “TCP-friendly” mode: it tracks what standard Reno would do and uses whichever window is larger, ensuring it is never less aggressive than Reno.
BBR (Bottleneck Bandwidth and RTT)
In 2016, Google introduced BBR — a fundamentally different approach to congestion control. While Reno and CUBIC are loss-based (they grow until they cause loss, then back off), BBR is model-based: it continuously estimates the bottleneck bandwidth and minimum RTT, then paces packets to match.
The Problem with Loss-Based Algorithms
Loss-based algorithms fill router buffers before detecting congestion. This causes bufferbloat — excess latency from packets sitting in queues:
Loss-based (CUBIC) Model-based (BBR)
operating point operating point
Throughput Throughput
| ___________ | ___________
| / \ | / \
| / \ | / \
| / \ | * <-- here! \
| / \ | / \
|/ \ |/ \
+----------*----------+-- +---------------------+--
^ inflight inflight
|
here!
(buffer full, loss)
CUBIC operates at the right edge (buffer overflow point).
BBR operates at the left knee (maximum throughput, minimum delay).
BBR’s Model
BBR maintains two estimates:
- BtlBw (bottleneck bandwidth): The maximum delivery rate observed over a sliding window of ~10 RTTs.
- RTprop (round-trip propagation delay): The minimum RTT observed over a sliding window of ~10 seconds.
The optimal in-flight data is:
BDP = BtlBw * RTprop
BBR paces packets at BtlBw rate and caps in-flight data at approximately 2 * BDP (with a gain factor for probing).
BBR State Machine
BBR cycles through four states:
BBR State Machine
+──────────────────────────────────────────────────────────────+
| |
| ┌──────────┐ ┌───────────────┐ ┌──────────────┐ |
| │ STARTUP │────>│ DRAIN │────>│ PROBE_BW │ |
| │ │ │ │ │ │ |
| │ 2/ln2 │ │ Drain queue │ │ Steady state │ |
| │ gain │ │ down to BDP │ │ 8-phase cycle│ |
| └──────────┘ └───────────────┘ └──────┬───────┘ |
| │ |
| │ every |
| │ ~10 sec |
| ▼ |
| ┌──────────────┐ |
| │ PROBE_RTT │ |
| │ │ |
| │ cwnd = 4 │ |
| │ for 200ms │ |
| └──────────────┘ |
+──────────────────────────────────────────────────────────────+
STARTUP: Exponential probing (like slow start) to find BtlBw.
Exit when BtlBw plateaus (3 rounds without 25% increase).
DRAIN: Reduce inflight to match the BDP (empty queues filled
during startup).
PROBE_BW: Steady-state cycling. Alternates pacing gain between
1.25, 0.75, and 1.0 to probe for more bandwidth.
PROBE_RTT: Periodically drains the pipe (cwnd = 4 packets for
200ms) to get a fresh RTprop measurement.
BBR in the Linux Kernel
BBR v1 is at net/ipv4/tcp_bbr.c. BBR v2 is in development and addresses fairness concerns with loss-based flows.
Key structures:
struct bbr {
u32 min_rtt_us; /* min RTT in min_rtt_win_sec window */
u32 min_rtt_stamp; /* timestamp of min_rtt_us */
u32 probe_rtt_done_stamp; /* end time for PROBE_RTT mode */
u32 prior_cwnd; /* prior cwnd upon entering loss recovery */
u32 full_bw; /* recent bw, to estimate if pipe is full */
...
enum bbr_mode mode; /* current mode/state */
...
};
To enable BBR on Linux:
# Check current algorithm
sysctl net.ipv4.tcp_congestion_control
# Enable BBR
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq
How the Linux Kernel Implements It
The Linux TCP stack is a layered design that separates the congestion control algorithm from the core TCP state machine:
Linux TCP Congestion Control Architecture
+─────────────────────────────────────────────────────────+
| Application (send/recv) |
+─────────────────────────────────────────────────────────+
| TCP Core (tcp_output.c, tcp_input.c) |
| - Segmentation, ACK processing, retransmit logic |
| - Calls into congestion ops via function pointers |
+─────────────────────────────────────────────────────────+
| struct tcp_congestion_ops (pluggable interface) |
| .init() - Initialize algorithm state |
| .cong_avoid() - Called on each ACK (grow window) |
| .ssthresh() - Calculate new ssthresh on loss |
| .undo_cwnd() - Undo cwnd reduction (spurious loss) |
| .pkts_acked() - Called with RTT sample |
| .set_state() - Notify of state transitions |
+─────────────────────────────────────────────────────────+
| Registered algorithms: cubic, bbr, reno, vegas, ... |
+─────────────────────────────────────────────────────────+
The congestion ops interface is defined in include/net/tcp.h:
struct tcp_congestion_ops {
/* Required: */
u32 (*ssthresh)(struct sock *sk);
void (*cong_avoid)(struct sock *sk, u32 ack, u32 acked);
/* Optional: */
void (*init)(struct sock *sk);
void (*release)(struct sock *sk);
void (*set_state)(struct sock *sk, u8 new_state);
void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
u32 (*undo_cwnd)(struct sock *sk);
...
};
When an ACK arrives, the path through the kernel is:
tcp_v4_rcv() [net/ipv4/tcp_ipv4.c]
-> tcp_rcv_established() [net/ipv4/tcp_input.c]
-> tcp_ack() [net/ipv4/tcp_input.c]
-> tcp_cong_avoid() [net/ipv4/tcp_input.c]
-> icsk->icsk_ca_ops->cong_avoid()
(dispatches to cubic, bbr, etc.)
Key Kernel State Variables
In struct tcp_sock (from include/linux/tcp.h):
struct tcp_sock {
...
u32 snd_cwnd; /* Sending congestion window */
u32 snd_ssthresh; /* Slow start size threshold */
u32 srtt_us; /* Smoothed RTT in microseconds */
u32 mdev_us; /* Medium deviation of RTT */
u32 snd_cwnd_cnt; /* Linear increase counter */
u32 prior_cwnd; /* cwnd before loss */
...
};
Practical Implications
Observing cwnd in Real Time
The ss (socket statistics) tool shows TCP internal state, including cwnd:
$ ss -i dst 10.0.0.1
tcp ESTAB 0 0 10.0.0.2:22 10.0.0.1:54321
cubic wscale:7,7 rto:204 rtt:1.5/0.5 ato:40 mss:1448
cwnd:42 ssthresh:32 bytes_sent:128000 bytes_acked:127500
send 323.5Mbps pacing_rate 387.2Mbps delivery_rate 312.1Mbps
retrans:0/2 rcv_space:29200 minrtt:0.8
Key fields:
cwnd:42— current congestion window (segments)ssthresh:32— slow-start thresholdrtt:1.5/0.5— smoothed RTT / RTT variance (ms)retrans:0/2— current unresolved retransmits / total retransmits
Tuning for High-BDP Networks
On a 10 Gbps link with 100ms RTT:
BDP = 10 Gbps * 100ms = 1 Gbit = 125 MB
Required socket buffer >= 125 MB to fill the pipe!
Linux tuning parameters:
# Increase maximum socket buffer sizes
sysctl -w net.core.rmem_max=134217728 # 128 MB
sysctl -w net.core.wmem_max=134217728 # 128 MB
# TCP auto-tuning range: min, default, max (bytes)
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"
When BBR Helps
BBR excels in scenarios where loss-based algorithms suffer:
| Scenario | CUBIC | BBR |
|---|---|---|
| High-BDP links (intercontinental) | Slow ramp-up | Fast convergence |
| Bufferbloated paths | Fills buffers, high latency | Avoids buffer filling |
| Random (non-congestion) loss | Treats as congestion, backs off | Ignores non-congest loss |
| Shallow buffers | Under-utilizes link | Matches link rate |
However, BBR v1 has known issues:
- Unfairness to loss-based flows: BBR can consume more than its fair share when competing with CUBIC/Reno.
- High retransmission rates: The probing behavior can cause 1-3% retransmission in some conditions.
- PROBE_RTT stalls: The 200ms drain every 10 seconds can cause throughput dips for latency-sensitive applications.
BBR v2 and v3 (in development) address these by incorporating loss signals and improving fairness.
Choosing an Algorithm
# List available algorithms
sysctl net.ipv4.tcp_available_congestion_control
# Per-socket override (in application code)
setsockopt(fd, IPPROTO_TCP, TCP_CONGESTION, "bbr", 3);
For most server workloads at scale:
- CUBIC (default): Good general-purpose choice. Well-tested, fair.
- BBR: Better for long-fat pipes, content delivery, and paths with random loss.
- DCTCP: For data centers with ECN-capable switches (marks rather than drops).
Summary
Evolution of TCP Congestion Control
1986 Congestion collapse discovered
1988 Jacobson: Slow Start + Congestion Avoidance (Tahoe)
1990 Fast Retransmit + Fast Recovery (Reno) [RFC 2581]
1994 TCP Vegas (delay-based, not widely deployed)
2004 BIC (aggressive binary search)
2006 CUBIC replaces BIC as Linux default [kernel 2.6.19]
2010 Compound TCP (Windows default)
2016 BBR v1 (Google, model-based) [kernel 4.9]
2019 BBR v2 alpha (adds loss sensitivity)
2022 BBR v3 (improved fairness)
Key Idea:
─────────────────────────────────────────────────────────────
Loss-based: Grow until you break something, then back off.
Model-based: Measure the pipe, send at measured rate.
─────────────────────────────────────────────────────────────
References
- V. Jacobson, “Congestion Avoidance and Control,” ACM SIGCOMM 1988. PDF
- RFC 5681 — TCP Congestion Control (Reno). Link
- RFC 9438 — CUBIC for Fast and Long-Distance Networks. Link
- N. Cardwell et al., “BBR: Congestion-Based Congestion Control,” ACM Queue, 2016. Link
- Linux kernel TCP CUBIC source. GitHub
- Linux kernel TCP BBR source. GitHub
- Linux kernel TCP input processing. GitHub
- RFC 6928 — Increasing TCP’s Initial Window. Link
- Bufferbloat project and diagnosis. Link
- Google BBR v2 progress. GitHub