Why does kube-state-metrics fail Application Triage?
By the time you confirm a container was OOMKilled, your first instinct is to check the dashboard. You pull up container_memory_usage_bytes, watch the line climb to the limit, then drop. The graph confirms the kill but explains nothing about why memory grew.
That is the fundamental problem with standard Kubernetes metrics for memory triage: they describe the cgroup boundary, not what is happening inside it.
What container_memory_usage_bytes Actually Measures?
container_memory_usage_bytes, exposed by cAdvisor and scraped by most Prometheus setups, reports the total memory charged to a container's cgroup. Under cgroup v2, this includes anonymous memory (heap, stack, mmap regions), kernel memory attributed to the container, and page cache from file I/O all collapsed into a single number.
kube-state-metrics sits one level above this. It surfaces pod metadata: resource requests and limits, restart counts, container states. It is useful for alerting that something happened. It is not a diagnostic tool.
The Internal Allocation Blind Spot
A running process organizes memory into distinct regions: heap (dynamically allocated), stack (per-thread, for function calls), native or off-heap memory (allocated directly via mmap or malloc, often by native libraries), and shared code segments.
Standard exporters have no visibility into this breakdown. They sit outside the process and see only what the kernel reports at the cgroup boundary. Whether your heap has fragmented, whether a thread pool is growing unbounded, or whether a C extension is allocating outside the language runtime none of that is visible from outside.
Why Does This Matters in Practice?
This gap drives a common triage failure: memory climbs, the engineer raises the container limit, and the problem returns in days. The limit was never the issue the leak was. But because the tooling only surfaced cgroup totals, the investigation stopped at the wrong layer.
Standard metrics are well suited for detecting that memory is growing, triggering alerts when usage approaches limits, and correlating OOMKill events across pods. They are not suited for identifying which component inside the process is allocating, distinguishing heap growth from page cache accumulation, or confirming whether memory was freed or just marked for reuse.
Answering those questions requires moving inside the process which is what the following sections address, starting with the two most common runtimes in Kubernetes environments.
Go Memory Management in Containers: RSS Inflation and Fragmentation
Go is the dominant language for Kubernetes workloads; the control plane itself is written in it, as are most operators and sidecars. Its memory model also behaves in ways that routinely mislead engineers during incident triage.
The core issue: Go can report that it freed memory while the kernel and the cgroup counter still shows it as consumed.
How Go Manages Memory?
Go uses its own allocator between the application and the OS. When the garbage collector (GC) marks objects as unreachable, it frees them internally but does not immediately return pages to the OS. Go signals this via MADV_DONTNEED, but the kernel may keep those pages resident, particularly under low memory pressure.
The result: Go's internal heap stats show significantly less memory than the RSS the kernel reports and what the cgroup counts against your limit. This is expected behavior, but in containers with hard limits, the gap is where OOMKills happen.
The GOGC Problem in Containers
GOGC=100 (the default) means GC triggers when the live heap doubles from its post-collection size. In a container with a tight memory limit, this fires too late.
Example: live heap stabilizes at 200 MB. With default GOGC, the next GC triggers at 400 MB. If your container limit is 350 MB, the cgroup limit is breached before Go decides to collect.
Go 1.19 introduced GOMEMLIMIT to address exactly this. It gives the runtime a soft ceiling it actively tries to stay under by running GC more aggressively as memory approaches the value. In production, set GOMEMLIMIT roughly 10–15% below your container memory limit to leave headroom for non-heap usage.
RSS Inflation Beyond the Heap
Even with correct GC tuning, RSS can run well above live heap for a few reasons:
Allocator fragmentation: Go's allocator organizes memory into size classes. Mixed allocation patterns leave gaps between live objects. Those pages remain resident even when logically unused.
Goroutine stacks: Each goroutine starts with an 8 KB stack that grows as needed. A service holding 10,000 goroutines with grown stacks carries hundreds of megabytes outside the heap, all of which counts toward cgroup usage.
CGO allocations: Memory allocated inside C code linked via CGO is invisible to the Go GC. It does not get collected and still counts toward your container's memory limit.
Reading the Signals
When a Go service is OOMKilled, compare go_memstats_heap_inuse_bytes against container RSS. A large gap points to fragmentation, goroutine stack growth, or CGO. Heap and RSS growing in lockstep with no GC relief usually indicates a true leak an unbounded cache, a goroutine leak, or missing CGO cleanup.
The key mental model: RSS is the OS view, heap stats are Go's view, and the cgroup limit is the hard wall. All three can show different numbers simultaneously. Understanding why they diverge is how the actual problem becomes visible.
JVM Containment Realities: Beyond Heap (Native and Metaspace Leakage)
Java in containers has a complicated history. Early JVM versions were not container-aware they read memory limits from the host, not the cgroup, and sized heap accordingly. Modern JDK versions (17+) handle this correctly by default, using -XX:+UseContainerSupport (enabled by default since JDK 10) to detect cgroup limits and set heap sizes relative to them.
But correct heap sizing is only part of the problem. The more common reason Java workloads get OOMKilled in 2026 is memory that lives entirely outside the heap memory that -Xmx does not control and that most engineers do not account for when sizing containers.
The JVM Memory Model Inside a Container
When you set -Xmx512m, you are capping only the Java heap in the region where your objects live. The JVM process consumes considerably more memory than that. Inside a container's total cgroup limit, you are actually dealing with:
- Java Heap: controlled by -Xmx
- Metaspace: class metadata, method bytecode, interned strings. No hard limit by default.
- Thread stacks: each thread gets a native stack, sized by -Xss (default 512 KB–1 MB depending on JDK and platform)
- Direct ByteBuffers: off-heap memory allocated explicitly by application or library code, controlled by -XX:MaxDirectMemorySize
- JIT compiled code cache: native code generated by the JIT compiler, controlled by -XX:ReservedCodeCacheSize
- GC overhead: internal GC data structures, card tables, and remembered sets, which scale with heap size
- Native libraries: memory allocated by JNI code or native libraries outside JVM control entirely
A container running with -Xmx512m and a 768 MB limit is not leaving 256 MB of headroom; it is hoping that all the above fits in 256 MB. For many workloads, it does not.
Metaspace Leaks
Metaspace stores class metadata and is allocated off-heap. Unlike the old PermGen (removed in Java 8), Metaspace grows dynamically without a default cap. In applications that generate classes at runtime frameworks using reflection heavily, dynamic proxies, bytecode generation via libraries like Byte Buddy, CGLIB, or ASM Metaspace can grow unbounded.
The failure mode is slow and hard to spot. The heap stays within -Xmx, GC runs normally, and heap metrics look clean. Meanwhile, Metaspace is quietly accumulating class metadata from dynamically generated classes that are never unloaded because their classloaders remain reachable.
Set -XX: MaxMetaspaceSize explicitly in production. A value between 256 MB and 512 MB is reasonable for most services. Without it, the JVM will consume as much native memory as the OS allows until the cgroup disagrees.
Monitor with: jcmd <pid> VM.native_memory or expose jvm_memory_pool_bytes_used via Micrometer with the Metaspace pool tagged.
Direct Buffer Leaks
DirectByteBuffer allocates memory outside the Java heap via malloc. It is used extensively by NIO, Netty, Kafka clients, and many other networking libraries. The memory is released when the buffer is garbage collected, but since it lives off-heap, GC pressure does not reflect the true memory cost the heap stays small while native memory grows.
Leaks here typically occur when direct buffers are allocated faster than they are collected, or when they are held in long-lived references. -XX:MaxDirectMemorySize caps this allocation, and when the cap is hit, a java.lang.OutOfMemoryError: Direct buffer memory is thrown which is actually preferable to a silent OOMKill, because at least you get a signal.
In practice, many teams do not set this flag, leaving direct buffer allocation uncapped. Combined with a Netty-heavy workload under load, this is a reliable path to an OOMKill with a heap that looks completely healthy.
Thread Stack Contribution
Each JVM thread consumes a native stack of roughly 512 KB to 1 MB by default. This is not heap, it is native memory, and it counts toward your cgroup total. A service under load with 500 active threads is carrying 250–500 MB in stacks alone, outside the heap entirely.
Thread pool sizing and stack size (-Xss) should both be explicitly configured. Frameworks that create threads per request (older servlet containers, some blocking RPC frameworks) are particularly susceptible to this under traffic spikes.
Memory Pressure During GC and JIT
Two transient but significant memory events that cause OOMKills in otherwise stable Java services:
GC promotion spikes: During a full GC, the collector may temporarily hold two copies of surviving objects while moving them between regions (particularly in G1 and ZGC). For a short window, memory usage can spike 20–40% above steady-state. If your container limit has no headroom for this, the OOM happens during collection precisely when you least expect it.
JIT compilation bursts: When a new deployment starts or traffic suddenly shifts to new code paths, the JIT compiler generates native code for newly hot methods. This temporarily inflates the code cache. On startup especially, JIT activity can push memory noticeably above the steady-state baseline.
Both are transient by nature. But transient means they show up as sudden spikes on the cgroup counter which is all the kernel needs to trigger an OOM event.
What to Monitor?
For Java workloads, heap metrics alone are insufficient. A complete picture requires:
| Memory Region | How to Observe |
| Java Heap | jvm_memory_used_bytes{area="heap"} |
| Metaspace | jvm_memory_used_bytes{area="nonheap", id="Metaspace"} |
| Direct Buffers | jvm_buffer_memory_used_bytes{id="direct"} |
| Thread count | jvm_threads_live_threads |
| Native total | jcmd <pid> VM.native_memory summary |
The process-level RSS from /proc/<pid>/status combined with VM.native_memory output gives you the most accurate picture of what the JVM is actually consuming versus what the cgroup is charging.
eBPF: The Senior SRE's Tool for Internal Diagnostics
Profiling a process from inside a pod is invasive attaching a profiler means restarts, modified images, or elevated permissions. eBPF changes this. It allows you to attach observability logic to a running kernel without modifying the application, stopping the process, or loading kernel modules.
For memory diagnostics specifically, eBPF closes the gap that runtime metrics leave open: it lets you observe allocations at the system call level, in real time, on a live production process.
What eBPF Is and Is Not?
eBPF (extended Berkeley Packet Filter) is a kernel subsystem that allows small, sandboxed programs to run inside the Linux kernel in response to events, system calls, tracepoints, kprobes, uprobes. These programs are verified before execution, preventing them from crashing the kernel.
For memory diagnostics, the relevant events are:
- malloc / free calls in userspace (via uprobes on libc)
- mmap / munmap system calls
- brk system calls (heap expansion)
- Kernel memory allocation functions (kmalloc, kfree)
eBPF programs can capture the call stack at the moment of allocation, record the size, and track whether the corresponding free ever arrives. This is how you go from "something is leaking" to "this specific function on this call path is leaking."
Using BCC memleak in Production
BCC (BPF Compiler Collection) provides memleak, a ready-made tool that tracks outstanding allocations memory that was allocated but not freed within a sampling window.
Basic usage against a running process:
bash # Attach to a specific PID, sample every 5 seconds memleak -p <pid> -a 5 |
Output shows allocation stacks with their cumulative unreleased bytes:
[11:04:32] Top 10 stacks with outstanding allocations: 96 bytes in 2 allocations from stack alloc_buffer+0x18 [myservice] process_request+0x74 [myservice] handle_connection+0x1a2 [myservice] [libpthread-2.31.so] |
This output tells you that alloc_buffer, called from process_request, has 96 bytes outstanding and growing. Run this over several intervals and watch which stacks accumulate those are your leaks.
Production safety considerations:
- memleak has overhead proportional to allocation rate. On services with extremely high malloc frequency, use -z <min_bytes> to filter small allocations and reduce noise.
- Use -T to set a collection interval that does not flood the kernel buffer.
- Attach and detach cleanly BCC tools remove their probes on exit.
- Avoid running on a node already under memory pressure. eBPF programs and BCC compilation itself consume memory.
Using bpftrace for Targeted Allocation Tracing
bpftrace provides a higher-level scripting interface for writing custom eBPF probes. For memory work, it is useful when you have a hypothesis: you suspect a specific library or code path and want to confirm it without waiting for memleak to surface it.
Trace all malloc calls over 1 MB in a specific process:
bash bpftrace -e ' uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc /pid == <target_pid>/ { if (arg0 > 1048576) { printf("Large malloc: %d bytes\n Stack: %s\n", arg0, ustack); } }' |
This fires on every malloc call in the target process, filters for allocations above 1 MB, and prints the userspace call stack at the moment of allocation. You see exactly where large allocations originate without touching the application.
For tracking mmap calls (common in Go, JVM, and native library allocations):
bash bpftrace -e ' tracepoint:syscalls:sys_enter_mmap /pid == <target_pid>/ { printf("mmap: len=%d, prot=%d\n %s\n", args->len, args->prot, ustack); }' |
Reading eBPF Flame Graphs for Memory
When you have captured allocation stacks over a period of time, collapsing them into a flame graph gives you an immediate visual of where memory is being allocated. Tools like FlameGraph (Brendan Gregg's toolkit) can process folded stack output from bpftrace or perf.
In a memory flame graph, width represents cumulative bytes allocated from that code path, not CPU time. A wide bar near the top of a call chain that is not matched by corresponding frees is your leak. It is significantly faster to read than scrolling through raw stack traces, particularly in services with many concurrent allocation paths.
When to Reach for eBPF?
eBPF is not the first tool in the investigation. It requires kernel version 4.9+ (ideally 5.8+ for full BTF support), appropriate privileges or CAP_BPF, and familiarity with the tooling. It also adds overhead, which matters on nodes already under pressure.
Reach for it when:
- Runtime metrics show memory growing but cannot point to the cause
- The process is a black box no profiling endpoint, no debug flags
- The leak appears in native code or a library where application-level instrumentation has no reach
- You need to confirm a hypothesis with direct evidence rather than inference
What eBPF gives you that nothing else does is attribution at the allocation site, not a metric showing that memory grew, but the exact call stack that allocated the memory that never came back. That distinction is the difference between guessing and knowing.
Page Cache Accounting Under Cgroup v2
This section is specifically about application behavior and how your service's own file I/O accumulates memory inside its cgroup boundary. Node-level page cache dynamics and memory pressure propagation were covered in Part 1. The focus here is narrower: why an application with low heap usage can still get OOMKilled because of what it writes or reads.
How Cgroup v2 Accounts for Page Cache?
Under cgroup v2, page cache is included in a container's memory accounting by default. When your application reads from disk or writes to a file, the kernel stores that data in page cache and charges it to the cgroup that triggered the I/O.
This is different from anonymous memory (heap, stack). Page cache is reclaimable the kernel can drop it under pressure without data loss. But "reclaimable" does not mean "free." The kernel reclaims page cache on a best-effort basis. Under certain conditions, it may not reclaim fast enough before the cgroup limit is hit, and the OOM killer fires.
The key cgroup v2 memory metric to understand here is memory.current, which includes both anonymous memory and page cache. A container can have 200 MB of live heap, 300 MB of page cache from log writes, and a 512 MB limit and get killed, even though active working memory is well within bounds.
The Application Patterns That Trigger This
Not all applications accumulate page cache equally. The ones that reliably hit this in production share a common trait: high-throughput, continuous file I/O within the container boundary.
Log-heavy applications: Services that write structured logs at high volume debug logging left on in production, audit logs, access logs from high-traffic endpoints accumulate page cache rapidly. The log file sits on a container-local path or a mounted volume. Every write goes through the page cache. If the application writes faster than the kernel flushes dirty pages to disk, page cache grows inside the cgroup.
Local file processing: Batch jobs or data pipelines that read large files, process them, and write output locally keep both the input and output in page cache simultaneously. A job reading a 400 MB input file and writing a 300 MB output, inside a container with a 1 GB limit, can be consuming 700 MB in page cache before the application heap is even considered.
Excessive fsync avoidance: Applications that buffer writes and flush infrequently let dirty page cache accumulate. The kernel has dirty page limits at the system level, but within a cgroup, the application's own dirty pages count against its limit before the system-level writeback kicks in.
Why the OOMKill Looks Confusing
The symptom that makes this pattern hard to diagnose: heap metrics look normal. go_memstats_heap_inuse_bytes is fine. JVM heap usage is within bounds. The application is not leaking. But container_memory_usage_bytes is at the limit, and the pod is killed.
The signal to look for is the gap between memory.current and memory.anon in the cgroup's memory stats. memory.anon reports only anonymous memory heap and stack. If memory.current significantly exceeds memory.anon, page cache is the difference.
From inside the container or via kubectl exec:
bash cat /sys/fs/cgroup/memory.stat | grep -E 'anon|file|cache' |
Key fields:
- anon: anonymous memory (heap, stack, mmap private)
- file: page cache (file-backed memory, includes logs and reads)
- active_file / inactive_file:file cache split by recency; inactive_file is the most readily reclaimable
If file is large and inactive_file is a significant portion of it, you have reclaimable page cache sitting at your cgroup boundary. The kernel should reclaim this under pressure but if I/O is continuous and the container limit is tight, the reclaim path may not keep pace with the write rate.
Mitigations in Production
Set memory.high in cgroup v2: This is a soft throttling limit below memory.max. When memory.current approaches memory.high, the kernel aggressively reclaims reclaimable memory (including page cache) and throttles the process slightly. This gives the kernel time to clean up before the hard limit is hit. In Kubernetes, this corresponds to configuring memory limits with enough gap that the kernel has room to act.
Use O_DIRECT for large sequential reads: Applications doing bulk file reads that do not benefit from caching can open files with O_DIRECT to bypass the page cache entirely. Not always appropriate, but effective for batch workloads reading large input files once.
Control log verbosity at runtime: Debug logging in production is a common culprit. Structured logging frameworks that support runtime log level changes (without restart) let you dial back verbosity under pressure without redeployment.
Mount log paths on ephemeral volumes with size limits: Writing logs to an emptyDir volume with a sizeLimit keeps them outside the main container's writable layer, reducing page cache accumulation within the application cgroup.
The broader takeaway: if your application does significant file I/O, heap sizing alone is not enough. You need to account for the page cache footprint your I/O pattern generates, and either size the container limit to accommodate it or actively manage I/O behavior to reduce it.
Conclusion
Part 2 has been about moving past the metric that tells you memory grew, to the evidence that tells you why.
Standard observability stops at the cgroup boundary. Go creates a gap between what the runtime considers freed and what the kernel still holds. The JVM carries significant memory outside the heap that -Xmx never touches. eBPF attributes allocations to the exact call path responsible. Page cache from your own application's I/O can trigger an OOMKill even when heap looks perfectly healthy.
The pattern across all of it is the same: stop adjusting limits based on graphs, and start profiling based on what the process is actually doing.
Up next Part 3: Prevention and Platform Architecture
You have detected the kill. You have diagnosed the code. Part 3 moves to the decisions that stop it from recurring VPA realities, QoS classes as architectural drivers, Karpenter interactions, and whether your workload design is the reason your memory limits keep getting challenged in the first place.

