Skip to content
JZLeetCode
Go back

System Design - RAM vs. cgroup Limit/Cap vs. RSS

Table of contents

Open Table of contents

Context

When a machine or container runs out of memory, people often look at three different numbers:

These numbers are related, but they do not answer the same question.

That is the source of most memory confusion:

This post separates the terms clearly and shows how they fit together.

The Short Version

Think of the memory stack like this:

 +--------------------------------------------------------------+
 | Host RAM                                                     |
 | Physical memory on the machine                               |
 | Used by processes, kernel, page cache, tmpfs, buffers, etc.  |
 +-----------------------------+--------------------------------+
                               |
                               v
 +--------------------------------------------------------------+
 | cgroup memory limit / cap                                    |
 | Maximum memory a GROUP of processes is allowed to consume     |
 | Commonly used for containers                                 |
 +-----------------------------+--------------------------------+
                               |
                               v
 +--------------------------------------------------------------+
 | RSS                                                          |
 | Memory pages currently resident in RAM for ONE process        |
 | Reported per process by ps/top/proc                          |
 +--------------------------------------------------------------+

So the rough mental model is:

1. RAM: The Machine’s Physical Memory

RAM is the actual physical memory installed in a machine.

If a host has 64 GiB of RAM, that is the total pool the Linux kernel manages. Every workload on the machine competes for space in that pool:

That means “used RAM” is not the same thing as “memory owned by my application.”

On Linux, free -h often looks confusing because it includes page cache and buffers:

free -h

You may see something like:

               total        used        free      shared  buff/cache   available
Mem:            64Gi        50Gi         2Gi       1.2Gi        12Gi        11Gi

This does not mean applications have permanently consumed 50 GiB and only 2 GiB is left. A large part of buff/cache is reclaimable. The kernel uses free RAM aggressively for caching because unused RAM is wasted RAM.

So for host-level health, available is usually more meaningful than free.

2. cgroup Limit / Cap: A Boundary Around a Group

A cgroup (control group) is a Linux kernel mechanism for tracking and limiting resource usage for a group of processes.

For memory, the important idea is simple:

Containers rely on this. A Docker container or Kubernetes pod is usually backed by one or more cgroups.

So if a pod has:

resources:
  limits:
    memory: "8Gi"

that is not “the RSS limit of the main process.” It is closer to:

“All memory charged to this pod’s cgroup should stay under 8 GiB.”

On cgroup v2, the common files are:

/sys/fs/cgroup/memory.current
/sys/fs/cgroup/memory.max
/sys/fs/cgroup/memory.stat

On older cgroup v1 systems, the names are different, for example:

/sys/fs/cgroup/memory/memory.usage_in_bytes
/sys/fs/cgroup/memory/memory.limit_in_bytes

The important thing is the accounting scope:

That single difference explains many production incidents.

3. RSS: Resident Set Size

RSS stands for Resident Set Size.

It means the portion of a process’s memory that is currently resident in physical RAM.

You can inspect it with commands like:

ps -o pid,rss,comm -p <pid>
grep VmRSS /proc/<pid>/status

If a process shows:

VmRSS:   3145728 kB

then about 3 GiB of that process’s pages are currently in RAM.

But RSS is often misunderstood.

RSS is not:

RSS includes pages that are resident now, including some pages shared with other processes, such as shared libraries or shared mappings. Because of that, summing RSS across processes can overcount memory.

That is why tools like smem also expose PSS (Proportional Set Size), which divides shared pages across processes more fairly.

The Three Terms Side by Side

Here is the clean comparison:

TermScopeWhat it meansCommon command
RAMWhole machinePhysical memory on the hostfree -h, vmstat
cgroup memory usage / limitGroup of processesMemory charged to a container or cgroup, plus its configured boundarycat /sys/fs/cgroup/memory.current, cat /sys/fs/cgroup/memory.max
RSSOne processResident pages currently in RAM for that processps, top, /proc/<pid>/status

If you remember only one sentence, use this:

RAM is the machine’s pool, cgroup limit is the group’s boundary, and RSS is one process’s in-RAM footprint.

Why RSS and cgroup Memory Do Not Match

Suppose a container has two processes:

You might expect container memory usage to be about 3.5 GiB. But the cgroup may show 6.0 GiB instead.

Why?

Because cgroup accounting can include much more than the main process RSS:

Example:

Container cgroup limit: 8.0 GiB

  main process RSS                  3.2 GiB
  sidecar RSS                       0.3 GiB
  page cache charged to cgroup      1.8 GiB
  /dev/shm and tmpfs                0.5 GiB
  other charged memory              0.2 GiB
                                   --------
  cgroup memory.current             6.0 GiB

Nothing is inconsistent here. The numbers are measuring different things.

Why the Sum of RSS Can Also Be Misleading

Now take a different example.

Three worker processes each map the same 500 MiB shared library and shared memory segment:

worker A RSS = 1.2 GiB
worker B RSS = 1.2 GiB
worker C RSS = 1.2 GiB
sum of RSS   = 3.6 GiB

But a large chunk of those resident pages is shared, so the actual total memory impact may be much lower than 3.6 GiB.

That is why:

When you need fair attribution across many processes, PSS is often a better metric than RSS.

A Practical Container Example

Imagine this machine:

Host RAM: 64 GiB

One Kubernetes pod on it has:

Memory limit: 8 GiB

Inside that pod:

The picture looks like this:

Host RAM = 64 GiB

  +----------------------------------------------------------+
  | Host memory                                               |
  |                                                          |
  |  other workloads                          30.0 GiB       |
  |  this pod's cgroup usage                   7.3 GiB       |
  |  other page cache / kernel                10.7 GiB       |
  |  still available                          16.0 GiB       |
  +----------------------------------------------------------+

  Pod cgroup limit = 8.0 GiB

  +----------------------------------------------------------+
  | Pod memory charged to cgroup                              |
  |                                                          |
  |  Java RSS                                4.5 GiB         |
  |  sidecar RSS                             0.2 GiB         |
  |  page cache                              2.0 GiB         |
  |  tmpfs / shm                             0.6 GiB         |
  |                                          -------         |
  |  total charged                           7.3 GiB         |
  +----------------------------------------------------------+

This pod is close to its own cap even though the machine still has plenty of RAM available.

That is a common production pattern:

The opposite can also happen:

What Usually Triggers OOM in Containers

In containerized systems, the most common failure is not “RSS crossed some magic line.” The more typical story is:

  1. The cgroup’s charged memory keeps growing.
  2. The kernel tries reclaim.
  3. Reclaim is insufficient.
  4. The cgroup exceeds its hard boundary.
  5. The kernel OOM logic kills one or more processes in that cgroup.

So when debugging a container OOM, check the cgroup numbers first, not only the main process RSS.

What to Check in Real Incidents

If the question is “Is the machine under memory pressure?”, check host-level RAM:

free -h
vmstat 1

If the question is “Is this container close to its allowed cap?”, check cgroup usage:

cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.max
cat /sys/fs/cgroup/memory.stat

If the question is “Which process inside the container is large?”, check RSS:

ps -e -o pid,rss,comm --sort=-rss | head
top

If the question is “Why doesn’t the sum of processes line up?”, look at shared memory and fair-share tools:

smem -r
cat /proc/<pid>/smaps

A Better Debugging Sequence

A practical order is:

  1. Start with the cgroup limit and current usage.
  2. Check whether page cache, tmpfs, or shared memory is large.
  3. Then inspect per-process RSS.
  4. If multi-process accounting still looks strange, inspect PSS or smaps.
  5. Finally, compare with host-level RAM to see whether this is only a container problem or a node-wide problem.

This order prevents a common mistake: staring at one big process RSS number and assuming it fully explains a container OOM.

Final Mental Model

Use this compact model:

RAM            = capacity of the whole house
cgroup limit   = the maximum space one apartment may occupy
RSS            = the floor space currently occupied by one person

That analogy is imperfect, but it is good enough to keep the scopes straight:

Once you separate scope, most memory dashboards become much easier to read.

References

  1. Linux proc_pid_status man page
  2. Linux proc man page
  3. Linux kernel cgroup v2 admin guide
  4. Linux free man page
  5. Linux ps man page
Share this post on:

Next Post
System Design - How Rate Limiting Works