Table of contents
Open Table of contents
- Context
- Pages: The Unit of Caching
- The address_space: Connecting Files to Pages
- The Read Path: What Happens When You Call read()
- The Write Path: Buffered Writes and Dirty Pages
- Writeback: Flushing Dirty Pages to Disk
- Readahead: Predicting Future Reads
- Eviction: What Happens Under Memory Pressure
- O_DIRECT: Bypassing the Page Cache
- Observing the Page Cache
- How the Pieces Fit Together
- Why This Matters for Databases
- References
Context
Every time you read a file, the data does not necessarily come from disk. The Linux kernel keeps recently-accessed file data in RAM — a mechanism called the page cache. If you read the same file twice, the second read is served entirely from memory, costing microseconds instead of milliseconds.
This is not optional or application-controlled. The page cache operates transparently inside the kernel: applications call read() and write() as usual, and the kernel decides whether to serve from cache or go to disk. This single mechanism is why:
- A
grepover a large file is fast the second time you run it. - Databases like PostgreSQL, MySQL, and RocksDB interact heavily with the page cache (or deliberately bypass it with
O_DIRECT). - A machine with 64 GB of RAM might show only 4 GB “used” but 50 GB “cached” — that 50 GB is the page cache.
Application Layer
+------------------+ +------------------+ +------------------+
| PostgreSQL | | grep | | RocksDB |
| (buffered I/O) | | (read()) | | (O_DIRECT or |
| | | | | buffered) |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
| read()/write() | |
v v v
+----------------------------------------------------------------+
| Virtual File System (VFS) |
+----------------------------------------------------------------+
|
v
+----------------------------------------------------------------+
| Page Cache |
| |
| +--------+ +--------+ +--------+ +--------+ +--------+ |
| | page 0 | | page 1 | | page 2 | | page 3 | | page 4 | |
| | (4 KB) | | (4 KB) | | (4 KB) | | (4 KB) | | (4 KB) | |
| +--------+ +--------+ +--------+ +--------+ +--------+ |
| |
+----------------------------------------------------------------+
|
| (cache miss: fetch from disk)
v
+----------------------------------------------------------------+
| Block Layer / Filesystem (ext4, xfs) |
+----------------------------------------------------------------+
|
v
+----------------------------------------------------------------+
| Disk (SSD/HDD) |
+----------------------------------------------------------------+
The page cache makes the entire I/O stack faster without any application changes. Let’s look at how it is organized internally.
Pages: The Unit of Caching
The page cache operates in pages — fixed-size chunks of memory, typically 4 KB on x86-64 (matching the CPU’s memory management unit page size). Every cached file region maps to one or more struct page objects in the kernel:
// include/linux/mm_types.h (simplified)
struct page {
unsigned long flags; // PG_dirty, PG_locked, PG_uptodate, etc.
struct address_space *mapping; // which file this page belongs to
pgoff_t index; // page offset within the file
atomic_t _refcount; // reference count
struct list_head lru; // for LRU eviction lists
// ... many more fields via unions
};
Key flags that control page cache behavior:
| Flag | Meaning |
|---|---|
PG_uptodate | Page contents match what’s on disk (or are newer) |
PG_dirty | Page was modified; must be written back to disk |
PG_locked | Page is being read from disk or written back |
PG_writeback | Page is currently being flushed to disk |
PG_referenced | Page was accessed recently (used by eviction) |
The address_space: Connecting Files to Pages
Each open file (technically, each inode) has an address_space structure that acts as the index for its cached pages:
// include/linux/fs.h (simplified)
struct address_space {
struct inode *host; // the inode this cache belongs to
struct xarray i_pages; // the page index (radix tree / xarray)
atomic_t i_mmap_writable; // count of writable mmap users
gfp_t gfp_mask; // allocation flags for new pages
unsigned long nrpages; // number of cached pages
const struct address_space_operations *a_ops; // filesystem callbacks
// ...
};
The i_pages field is the heart of the lookup. It is an xarray (extended array) — a radix-tree-based data structure that maps a page index (file offset / page size) to the corresponding struct page:
address_space for /var/log/syslog
+--------------------------------------------------+
| i_pages (xarray) |
| |
| index 0 --> struct page (bytes 0-4095) |
| index 1 --> struct page (bytes 4096-8191) |
| index 2 --> (empty -- not cached yet) |
| index 3 --> struct page (bytes 12288-16383) |
| index 4 --> struct page (bytes 16384-20479) |
| ... |
+--------------------------------------------------+
The xarray provides lookup by file offset, where the branching factor is large (64 entries per node), making it effectively constant-time for practical file sizes.
The Read Path: What Happens When You Call read()
When an application calls read(fd, buf, 4096), here is the path through the kernel:
read(fd, buf, 4096)
|
v
vfs_read()
|
v
filemap_read() <-- the page cache read path
|
+-- filemap_get_pages()
| |
| +-- filemap_get_read_batch()
| | |
| | +-- find pages in xarray
| | |
| | +-- CACHE HIT?
| | |
| | yes: return page directly
| | no: allocate new page
| | |
| | v
| | filemap_create_folio()
| | |
| | v
| | filemap_read_folio()
| | |
| | v
| | a_ops->readahead() or a_ops->read_folio()
| | |
| | v
| | submit_bio() --> disk I/O
| | |
| | wait for I/O completion
| | |
| | set PG_uptodate
| |
| +-- return pages[]
|
+-- copy_page_to_iter() <-- copy from kernel page to user buffer
|
v
return bytes_read
The core implementation lives in mm/filemap.c. Here is the simplified lookup logic:
// mm/filemap.c (simplified from filemap_get_pages)
static int filemap_get_pages(struct kiocb *iocb, size_t count,
struct folio_batch *fbatch) {
struct address_space *mapping = iocb->ki_filp->f_mapping;
pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
// Try to find the page in the cache
filemap_get_read_batch(mapping, index, last_index, fbatch);
if (folio_batch_count(fbatch) == 0) {
// Cache miss: allocate a new folio and read from disk
struct folio *folio = filemap_create_folio(iocb, mapping, index);
filemap_read_folio(iocb, mapping->a_ops->read_folio, folio);
}
// Wait for any in-progress I/O to complete
filemap_update_page(iocb, mapping, fbatch);
return 0;
}
A folio (introduced in Linux 5.16) is a wrapper around one or more contiguous pages. It replaces the older struct page interface for most page cache operations, but the underlying concept is the same.
The Write Path: Buffered Writes and Dirty Pages
When an application calls write(), data does not go directly to disk. Instead:
write(fd, data, 4096)
|
v
vfs_write()
|
v
generic_perform_write()
|
+-- 1. Find or allocate page in cache
| grab_cache_page_write_begin()
|
+-- 2. Copy user data into the page
| copy_from_user()
|
+-- 3. Mark the page DIRTY
| set_page_dirty()
|
+-- 4. Return to userspace immediately
|
v
(data is now in RAM only -- not on disk yet)
The write returns to the application before the data hits disk. The kernel tracks dirty pages and flushes them later. This is why a power failure can lose recent writes (and why databases use fsync()).
// mm/page-writeback.c (simplified)
void folio_mark_dirty(struct folio *folio) {
struct address_space *mapping = folio->mapping;
if (!folio_test_dirty(folio)) {
folio_set_dirty(folio);
// Add to the dirty page accounting
__xa_set_mark(&mapping->i_pages, folio->index, PAGECACHE_TAG_DIRTY);
// Update per-zone dirty page counter
account_page_dirtied(folio);
}
}
The xarray uses tags (PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_WRITEBACK) to quickly find dirty pages without scanning the entire tree.
Writeback: Flushing Dirty Pages to Disk
Dirty pages must eventually reach disk. The kernel uses background threads called flusher threads (one per block device) to do this. Writeback is triggered by three conditions:
Writeback Triggers
+---------------------------------------------------+
| |
| 1. Timer expires (dirty_writeback_interval) |
| Default: every 5 seconds |
| /proc/sys/vm/dirty_writeback_centisecs = 500 |
| |
| 2. Dirty ratio exceeded |
| dirty pages > dirty_background_ratio (10%) |
| --> background writeback starts |
| |
| dirty pages > dirty_ratio (20%) |
| --> process BLOCKS until pages are written |
| |
| 3. Explicit sync |
| fsync(), fdatasync(), sync() |
| --> writes specific file's dirty pages NOW |
| |
+---------------------------------------------------+
The writeback worker in mm/page-writeback.c and fs/fs-writeback.c iterates through dirty inodes and their dirty pages:
// fs/fs-writeback.c (simplified)
static long writeback_sb_inodes(struct super_block *sb,
struct bdi_writeback *wb,
struct wb_writeback_work *work) {
while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
// Write dirty pages for this inode
__writeback_single_inode(inode, work);
// Move inode to clean list if no more dirty pages
if (mapping_empty(inode->i_mapping))
list_move(&inode->i_io_list, &wb->b_clean);
}
}
For each dirty page, the filesystem’s writepage() or writepages() callback is invoked, which submits I/O to the block layer. Once the I/O completes, the page is marked clean and the PG_writeback flag is cleared.
Readahead: Predicting Future Reads
Sequential reads are extremely common (reading a log file, scanning a table, streaming video). The kernel’s readahead mechanism detects sequential access and pre-fetches upcoming pages before the application asks for them:
Application reads pages: 0, 1, 2, 3, ...
Without readahead:
read page 0 --> disk I/O --> wait --> deliver
read page 1 --> disk I/O --> wait --> deliver
read page 2 --> disk I/O --> wait --> deliver
With readahead:
read page 0 --> disk I/O for pages 0-15 (prefetch)
read page 1 --> already in cache, instant
read page 2 --> already in cache, instant
...
read page 14 --> trigger ASYNC readahead for pages 16-47
read page 15 --> already in cache
read page 16 --> already in cache (prefetched)
The readahead logic lives in mm/readahead.c. Each file descriptor tracks readahead state:
// include/linux/fs.h
struct file_ra_state {
pgoff_t start; // current readahead window start
unsigned int size; // current readahead window size (pages)
unsigned int async_size; // pages before end to trigger next readahead
unsigned int ra_pages; // max readahead size (default: 256 KB)
// ...
};
The algorithm grows the readahead window exponentially (like TCP slow start) as it gains confidence that access is sequential:
First read: readahead 4 pages
Pattern holds: readahead 8 pages
Still going: readahead 16 pages
Maximum: readahead 64 pages (256 KB default)
controlled by /sys/block/<dev>/queue/read_ahead_kb
If the access pattern becomes random (seeking to a distant offset), readahead resets. This adaptive behavior ensures the prefetch doesn’t waste memory on random workloads.
Eviction: What Happens Under Memory Pressure
The page cache has no fixed size — it grows to fill all available RAM. When the system needs memory (for application allocations or new cache pages), the kernel must evict some cached pages. This is handled by the kswapd daemon and the direct reclaim path.
Linux uses a variant of the LRU (Least Recently Used) algorithm with two lists:
+------------------------------------------------------------+
| Page Cache LRU Lists |
| |
| Active List (hot pages) |
| +------+ +------+ +------+ +------+ +------+ |
| |page A|->|page B|->|page C|->|page D|->|page E| |
| +------+ +------+ +------+ +------+ +------+ |
| |
| Inactive List (eviction candidates) |
| +------+ +------+ +------+ +------+ +------+ |
| |page X|->|page Y|->|page Z|->|page W|->|page V| |
| +------+ +------+ +------+ +------+ +------+ |
| ^ | |
| | evict from tail v |
| | [freed] |
| |
+------------------------------------------------------------+
Promotion: inactive --> active (when accessed again)
Demotion: active --> inactive (when active list is too large)
Eviction: inactive tail pages freed (if clean) or written back (if dirty)
The two-list approach solves a classic problem: a single large sequential scan (like cat hugefile.bin > /dev/null) would evict all hot, frequently-accessed pages from cache. With two lists, pages from the scan enter the inactive list and get evicted quickly without disturbing the active list.
The eviction decision is made in mm/vmscan.c:
// mm/vmscan.c (simplified from shrink_folio_list)
static unsigned int shrink_folio_list(struct list_head *folio_list,
struct scan_control *sc) {
while (!list_empty(folio_list)) {
struct folio *folio = lru_to_folio(folio_list);
if (folio_test_dirty(folio)) {
// Dirty page: must write back before freeing
folio_start_writeback(folio);
mapping->a_ops->writepage(folio, &wbc);
continue; // will be freed after writeback completes
}
if (folio_test_referenced(folio)) {
// Recently accessed: promote back to active list
folio_clear_referenced(folio);
list_move(&folio->lru, &active_list);
continue;
}
// Clean, unreferenced: safe to free
__remove_mapping(mapping, folio);
free_unref_page(folio);
nr_reclaimed++;
}
return nr_reclaimed;
}
Key rules for eviction:
- Clean, unreferenced pages are freed immediately (zero cost).
- Dirty pages must be written back first (writeback I/O cost).
- Referenced pages get a second chance — they are promoted back to the active list.
O_DIRECT: Bypassing the Page Cache
Some applications (notably databases) want to manage their own caching. Opening a file with O_DIRECT bypasses the page cache entirely:
int fd = open("/data/tablespace", O_RDWR | O_DIRECT);
// reads and writes go directly to/from disk
// application must align buffers to 512-byte or 4096-byte boundaries
Why would a database bypass the page cache?
With page cache (double buffering):
+----------+ +------------+ +------+
| InnoDB | -> | Page Cache | -> | Disk |
| Buffer | | (kernel) | | |
| Pool | +------------+ +------+
+----------+
^-- data is copied TWICE in memory (wasteful)
With O_DIRECT:
+----------+ +------+
| InnoDB | -> | Disk |
| Buffer | | |
| Pool | +------+
+----------+
^-- single copy, database controls eviction policy
PostgreSQL uses buffered I/O (relies on the page cache). MySQL/InnoDB uses O_DIRECT by default. RocksDB can use either. Each choice has trade-offs:
| Approach | Pro | Con |
|---|---|---|
| Buffered (page cache) | Simple, good readahead, shared across processes | Double buffering, kernel controls eviction |
| O_DIRECT | No double copy, app controls caching | Must implement own readahead, complex alignment |
Observing the Page Cache
You can see page cache usage in several ways:
# Overall memory breakdown (Cached = page cache)
$ free -h
total used free shared buff/cache available
Mem: 62Gi 4.2Gi 1.1Gi 256Mi 57Gi 57Gi
# Per-file cache status (using fincore or vmtouch)
$ vmtouch /var/log/syslog
Files: 1
Directories: 0
Resident Pages: 1024/1024 4M/4M 100%
# Drop the page cache (useful for benchmarking)
$ echo 3 > /proc/sys/vm/drop_caches
# Tuning dirty page behavior
$ cat /proc/sys/vm/dirty_background_ratio # 10 (start background writeback)
$ cat /proc/sys/vm/dirty_ratio # 20 (block writers)
$ cat /proc/sys/vm/dirty_writeback_centisecs # 500 (flush interval: 5s)
How the Pieces Fit Together
Here is the complete lifecycle of a page in the cache:
1. ALLOCATE 2. FILL 3. SERVE
+------------------+ +------------------+ +------------------+
| alloc_page() | | submit_bio() | | copy_to_user() |
| add to xarray | ---> | wait for I/O | ---> | mark referenced |
| set PG_locked | | set PG_uptodate | | move to active |
+------------------+ | clear PG_locked | +------------------+
+------------------+
|
| (time passes)
v
6. FREE 5. WRITEBACK 4. DEMOTE
+------------------+ +------------------+ +------------------+
| remove from | | submit_bio() | | move to inactive |
| xarray | <--- | set PG_writeback | <--- | (active list too |
| free_page() | | clear PG_dirty | | large, or |
+------------------+ | clear PG_writeback| | memory pressure)|
+------------------+ +------------------+
Why This Matters for Databases
Understanding the page cache explains many database behaviors:
-
Why
fsync()is critical: After a write, data sits in the page cache (RAM). Onlyfsync()guarantees it reaches disk. A database that doesn’tfsync()after commits can lose data on power failure. -
Why cold starts are slow: After a restart, the page cache is empty. The first queries must fetch everything from disk. Some databases (like PostgreSQL’s
pg_prewarm) explicitly load hot data into the cache at startup. -
Why memory matters more than you think: A machine with 256 GB RAM might use only 16 GB for the database process, but the other 240 GB serves as a page cache — effectively a free, kernel-managed read cache for your data files.
-
Why
DROP CACHEShurts: Runningecho 3 > /proc/sys/vm/drop_cacheson a production database server evicts all cached data. The next few minutes of queries will hit disk for every read. -
Why sequential scans can trash performance: A full table scan reads every page sequentially. Without the two-list LRU, this would evict all hot index pages from cache. The inactive list absorbs the scan without disturbing active pages (though very large scans can still cause pressure).
References
- Linux kernel source, page cache read path
mm/filemap.c - Linux kernel source, page writeback
mm/page-writeback.c - Linux kernel source, readahead
mm/readahead.c - Linux kernel source, page reclaim
mm/vmscan.c - Linux kernel source, filesystem writeback
fs/fs-writeback.c - Linux kernel documentation, page cache doc
- LWN article, The multi-generational LRU article
- Brendan Gregg, Linux Page Cache Hit Ratio blog