System Design - RAM vs. cgroup Limit/Cap vs. RSS

Open Table of contents

Context
The Short Version
1. RAM: The Machine’s Physical Memory
2. cgroup Limit / Cap: A Boundary Around a Group
3. RSS: Resident Set Size
The Three Terms Side by Side
Why RSS and cgroup Memory Do Not Match
Why the Sum of RSS Can Also Be Misleading
A Practical Container Example
What Usually Triggers OOM in Containers
What to Check in Real Incidents
A Better Debugging Sequence
Final Mental Model
References

Context

When a machine or container runs out of memory, people often look at three different numbers:

free -h on the host
the container memory limit from Kubernetes or Docker
the RSS shown by ps, top, or /proc/<pid>/status

These numbers are related, but they do not answer the same question.

That is the source of most memory confusion:

A process RSS can be small while the container still hits its memory cap.
A container can stay below its cap while the host is under memory pressure.
The sum of RSS values can be larger than the memory actually used.

This post separates the terms clearly and shows how they fit together.

The Short Version

Think of the memory stack like this:

 +--------------------------------------------------------------+
 | Host RAM                                                     |
 | Physical memory on the machine                               |
 | Used by processes, kernel, page cache, tmpfs, buffers, etc.  |
 +-----------------------------+--------------------------------+
                               |
                               v
 +--------------------------------------------------------------+
 | cgroup memory limit / cap                                    |
 | Maximum memory a GROUP of processes is allowed to consume     |
 | Commonly used for containers                                 |
 +-----------------------------+--------------------------------+
                               |
                               v
 +--------------------------------------------------------------+
 | RSS                                                          |
 | Memory pages currently resident in RAM for ONE process        |
 | Reported per process by ps/top/proc                          |
 +--------------------------------------------------------------+

So the rough mental model is:

RAM answers: how much physical memory exists on the machine, and how much of it is in use.
cgroup limit/cap answers: how much memory a container or process group is allowed to use before reclaim or OOM handling kicks in.
RSS answers: how much of one process’s memory is currently resident in physical memory.

1. RAM: The Machine’s Physical Memory

RAM is the actual physical memory installed in a machine.

If a host has 64 GiB of RAM, that is the total pool the Linux kernel manages. Every workload on the machine competes for space in that pool:

user-space processes
shared libraries mapped into processes
kernel memory
filesystem page cache
tmpfs and shared memory
network buffers and other kernel-managed structures

That means “used RAM” is not the same thing as “memory owned by my application.”

On Linux, free -h often looks confusing because it includes page cache and buffers:

free -h

You may see something like:

               total        used        free      shared  buff/cache   available
Mem:            64Gi        50Gi         2Gi       1.2Gi        12Gi        11Gi

This does not mean applications have permanently consumed 50 GiB and only 2 GiB is left. A large part of buff/cache is reclaimable. The kernel uses free RAM aggressively for caching because unused RAM is wasted RAM.

So for host-level health, available is usually more meaningful than free.

2. cgroup Limit / Cap: A Boundary Around a Group

A cgroup (control group) is a Linux kernel mechanism for tracking and limiting resource usage for a group of processes.

For memory, the important idea is simple:

the kernel accounts memory usage for the whole group
the group can be given a memory limit or cap
if usage grows too high, the kernel tries reclaim first
if reclaim is not enough, the group can be OOM-killed

Containers rely on this. A Docker container or Kubernetes pod is usually backed by one or more cgroups.

So if a pod has:

resources:
  limits:
    memory: "8Gi"

that is not “the RSS limit of the main process.” It is closer to:

“All memory charged to this pod’s cgroup should stay under 8 GiB.”

On cgroup v2, the common files are:

/sys/fs/cgroup/memory.current
/sys/fs/cgroup/memory.max
/sys/fs/cgroup/memory.stat

On older cgroup v1 systems, the names are different, for example:

/sys/fs/cgroup/memory/memory.usage_in_bytes
/sys/fs/cgroup/memory/memory.limit_in_bytes

The important thing is the accounting scope:

cgroup memory usage is for the whole group
RSS is for one process

That single difference explains many production incidents.

3. RSS: Resident Set Size

RSS stands for Resident Set Size.

It means the portion of a process’s memory that is currently resident in physical RAM.

You can inspect it with commands like:

ps -o pid,rss,comm -p <pid>
grep VmRSS /proc/<pid>/status

If a process shows:

VmRSS:   3145728 kB

then about 3 GiB of that process’s pages are currently in RAM.

But RSS is often misunderstood.

RSS is not:

the process’s total virtual address space
the container’s total memory usage
a perfect measure of unique/private memory
a guarantee that the process alone is responsible for all those pages

RSS includes pages that are resident now, including some pages shared with other processes, such as shared libraries or shared mappings. Because of that, summing RSS across processes can overcount memory.

That is why tools like smem also expose PSS (Proportional Set Size), which divides shared pages across processes more fairly.

The Three Terms Side by Side

Here is the clean comparison:

Term	Scope	What it means	Common command
RAM	Whole machine	Physical memory on the host	`free -h`, `vmstat`
cgroup memory usage / limit	Group of processes	Memory charged to a container or cgroup, plus its configured boundary	`cat /sys/fs/cgroup/memory.current`, `cat /sys/fs/cgroup/memory.max`
RSS	One process	Resident pages currently in RAM for that process	`ps`, `top`, `/proc/<pid>/status`

If you remember only one sentence, use this:

RAM is the machine’s pool, cgroup limit is the group’s boundary, and RSS is one process’s in-RAM footprint.

Why RSS and cgroup Memory Do Not Match

Suppose a container has two processes:

main app: RSS = 3.2 GiB
sidecar: RSS = 0.3 GiB

You might expect container memory usage to be about 3.5 GiB. But the cgroup may show 6.0 GiB instead.

Why?

Because cgroup accounting can include much more than the main process RSS:

page cache charged to the cgroup
tmpfs or /dev/shm usage
memory from helper processes
shared mappings
allocator fragmentation
some kernel-accounted memory associated with the group

Example:

Container cgroup limit: 8.0 GiB

  main process RSS                  3.2 GiB
  sidecar RSS                       0.3 GiB
  page cache charged to cgroup      1.8 GiB
  /dev/shm and tmpfs                0.5 GiB
  other charged memory              0.2 GiB
                                   --------
  cgroup memory.current             6.0 GiB

Nothing is inconsistent here. The numbers are measuring different things.

Why the Sum of RSS Can Also Be Misleading

Now take a different example.

Three worker processes each map the same 500 MiB shared library and shared memory segment:

worker A RSS = 1.2 GiB
worker B RSS = 1.2 GiB
worker C RSS = 1.2 GiB
sum of RSS   = 3.6 GiB

But a large chunk of those resident pages is shared, so the actual total memory impact may be much lower than 3.6 GiB.

That is why:

RSS is useful per process
RSS is dangerous to sum blindly

When you need fair attribution across many processes, PSS is often a better metric than RSS.

A Practical Container Example

Imagine this machine:

Host RAM: 64 GiB

One Kubernetes pod on it has:

Memory limit: 8 GiB

Inside that pod:

Java process RSS = 4.5 GiB
log sidecar RSS = 0.2 GiB
page cache charged to the pod = 2.0 GiB
tmpfs files = 0.6 GiB

The picture looks like this:

Host RAM = 64 GiB

  +----------------------------------------------------------+
  | Host memory                                               |
  |                                                          |
  |  other workloads                          30.0 GiB       |
  |  this pod's cgroup usage                   7.3 GiB       |
  |  other page cache / kernel                10.7 GiB       |
  |  still available                          16.0 GiB       |
  +----------------------------------------------------------+

  Pod cgroup limit = 8.0 GiB

  +----------------------------------------------------------+
  | Pod memory charged to cgroup                              |
  |                                                          |
  |  Java RSS                                4.5 GiB         |
  |  sidecar RSS                             0.2 GiB         |
  |  page cache                              2.0 GiB         |
  |  tmpfs / shm                             0.6 GiB         |
  |                                          -------         |
  |  total charged                           7.3 GiB         |
  +----------------------------------------------------------+

This pod is close to its own cap even though the machine still has plenty of RAM available.

That is a common production pattern:

host healthy
container not healthy

The opposite can also happen:

a pod stays below its 8 GiB cap
but many pods together pressure the host’s 64 GiB RAM
then the node itself gets into memory trouble

What Usually Triggers OOM in Containers

In containerized systems, the most common failure is not “RSS crossed some magic line.” The more typical story is:

The cgroup’s charged memory keeps growing.
The kernel tries reclaim.
Reclaim is insufficient.
The cgroup exceeds its hard boundary.
The kernel OOM logic kills one or more processes in that cgroup.

So when debugging a container OOM, check the cgroup numbers first, not only the main process RSS.

What to Check in Real Incidents

If the question is “Is the machine under memory pressure?”, check host-level RAM:

free -h
vmstat 1

If the question is “Is this container close to its allowed cap?”, check cgroup usage:

cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/memory.stat

If the question is “Which process inside the container is large?”, check RSS:

ps -e -o pid,rss,comm --sort=-rss | head
top

If the question is “Why doesn’t the sum of processes line up?”, look at shared memory and fair-share tools:

smem -r
cat /proc/<pid>/smaps

A Better Debugging Sequence

A practical order is:

Start with the cgroup limit and current usage.
Check whether page cache, tmpfs, or shared memory is large.
Then inspect per-process RSS.
If multi-process accounting still looks strange, inspect PSS or smaps.
Finally, compare with host-level RAM to see whether this is only a container problem or a node-wide problem.

This order prevents a common mistake: staring at one big process RSS number and assuming it fully explains a container OOM.

Final Mental Model

Use this compact model:

RAM            = capacity of the whole house
cgroup limit   = the maximum space one apartment may occupy
RSS            = the floor space currently occupied by one person

That analogy is imperfect, but it is good enough to keep the scopes straight:

RAM is machine-wide.
cgroup limit/cap is group-wide.
RSS is process-wide.

Once you separate scope, most memory dashboards become much easier to read.