Kubernetes CPU Throttling: CFS Quotas and Latency Fixes

Subhendu Nayak
Kubernetes CPU Throttling: CFS Quotas and Latency Fixes

Why Kubernetes CPU Throttling Happens Even When CPU Usage Is Low


CPU utilization tells you how much CPU a container used on average. It does not tell you whether the container was throttled. Those are different measurements, and confusing them produces incidents that look impossible on a dashboard.

The alert fires. p99 latency is elevated. You open the CPU graph and every pod sits comfortably below its limit. Nothing looks overloaded. The spike settles after a few minutes, the ticket gets closed as a transient, and the same pattern returns three days later.

In many production environments, the explanation is Kubernetes CPU throttling caused by the Linux CFS quota mechanism. Understanding how it works changes how you interpret every CPU metric in a cluster.

What CPU limits actually control

When you set resources.limits.cpu on a container, you are not capping average utilization. You are telling the Linux kernel how much CPU time the container's cgroup may consume within a fixed enforcement window.

The kernel enforces this through the Completely Fair Scheduler (CFS). Two parameters in the cgroup define the constraint:

  • cpu.cfs_period_us: the length of the enforcement window in microseconds. The default value is 100,000, which is 100ms.
  • cpu.cfs_quota_us: the total CPU time the cgroup is permitted to consume within that window.

Kubernetes calculates the quota directly from your limit. A CPU limit of 500m (0.5 CPU) translates to a quota of 50,000 microseconds: the container may use 50ms of CPU time per 100ms period. The moment it exhausts that budget, the kernel parks the container until the next period begins. That forced pause is throttling.

CPU requests work differently. A request (resources.requests.cpu: 200m) tells the scheduler how much CPU to reserve for the pod. It also determines the pod's QoS class. Pods with equal requests and limits run as Guaranteed. Pods with requests lower than limits run as Burstable. Pods with no requests or limits at all run as BestEffort. QoS class affects scheduling priority and eviction order. It does not affect whether the container gets throttled at runtime. Only the CPU limit does that.

The CFS scheduler and the 100ms enforcement window

The 100ms default window is short. Short enough that the distinction between average utilization and per-window demand becomes practically important.

A container that needs 80ms of uninterrupted CPU time to complete a unit of work, but holds a 50ms quota per period, will exhaust its budget partway through. The kernel parks it and waits until the next CFS period begins before granting more CPU time. The wait is not a full 100ms. It is the time remaining in the current period when the quota ran out. If the container exhausted its quota 10ms into a period, it waits 90ms. If it exhausted the quota at 90ms into the period, it waits only 10ms. The worst case is when the quota is consumed right at the start of a period, leaving the container waiting nearly the full 100ms before the next window opens.

That waiting time does not appear in utilization graphs. It appears in latency.

The workloads that trigger this pattern most often are JVM-based services during garbage collection, Go runtimes during concurrent GC phases, and any service handling burst traffic where short-lived demand spikes within a single 100ms window. These applications are not continuously busy. Their CPU demand is spiky at sub-second resolution, which is exactly what average utilization metrics fail to capture.

Why low average utilization and high throttle rates coexist

This is the part that catches most engineers off guard the first time they encounter it.

Average CPU utilization is measured over windows far longer than 100ms. A Prometheus scrape interval of 15 or 30 seconds, or a Grafana panel averaging over 5 minutes, tells you nothing about what happened inside a single CFS enforcement period.

Consider an illustrative scenario. A container has a 500m CPU limit, giving it 50ms of CPU time per 100ms window. For most windows, the container is nearly idle. Once every few seconds, it handles a batch of requests and its CPU demand spikes to the equivalent of 0.8 CPU for a brief interval within one period. The container uses its 50ms quota and gets parked. Any request that arrived during that burst waits for the next CFS period to open before the container can resume.

Measured over 5 minutes, average utilization is low. Measured at the CFS window level, several periods during that burst are fully throttled. A user-facing request that arrived at the wrong moment absorbed forced waiting that no utilization graph would show.

This pattern also explains why reducing CPU limits on pods with low average utilization can make latency significantly worse. The average headroom is irrelevant. The per-window quota is what the kernel enforces.

What throttling looks like from the application side

The effect depends on what the container was doing when the kernel parked it.

For synchronous request handlers, throttling adds directly to response time. A handler needing 60ms of CPU work against a 50ms quota will take more than 100ms of wall time because of the forced wait between periods.

For async workers and queue consumers, throttling slows throughput without producing obvious errors. Message-processing rate drops and queue depth grows. The signal is lag, not failures.

For garbage-collected runtimes, throttling during GC extends the pause. Take a GC cycle that requires 90ms of CPU time against a 50ms quota. The container uses the first 50ms of its quota, then waits for the next CFS period to begin before consuming the remaining 40ms. Depending on where in the current period the quota ran out, the added wait could be a few milliseconds or close to the full period. In a scenario where the GC begins at the start of a period, the 90ms of CPU work takes around 140ms of wall time. Heap pressure and connection timeouts can follow.

None of these appear as high CPU utilization. They appear as elevated p99 and p999 latency, increased timeout rates, and growing queue backlogs.

Finding throttling in Prometheus

cAdvisor, which runs on every Kubernetes node, exposes the CFS metrics needed to confirm throttling. In most Kubernetes environments, these metrics are available by default, though exact availability can vary by kubelet version, container runtime, and the specifics of managed distributions. Three metrics are most relevant.

container_cpu_cfs_periods_total counts the total number of elapsed CFS enforcement periods for the container.

container_cpu_cfs_throttled_periods_total counts the periods during which the container was throttled at least once.

container_cpu_usage_seconds_total records the actual CPU time consumed by the container. It is useful alongside the throttle metrics to correlate actual usage with throttling frequency.

The throttle ratio is the key derived signal:

promql
rate(container_cpu_cfs_throttled_periods_total[5m])
/
rate(container_cpu_cfs_periods_total[5m])

This gives the fraction of CFS windows in which the container experienced throttling. Many platform teams use 20 to 25 percent as an initial investigation threshold, not as an industry standard but as a practical starting point. Above 50 percent, throttling is almost certainly a contributing factor to any latency degradation you are seeing.

A working alert rule:

yaml
- alert: ContainerCPUThrottling
  expr: |
    rate(container_cpu_cfs_throttled_periods_total[5m])
    /
    rate(container_cpu_cfs_periods_total[5m])
    > 0.25
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "CPU throttling above 25% on {{ $labels.container }}"
    description: "Container {{ $labels.container }} in pod {{ $labels.pod }} is throttled in more than 25% of CFS periods."

A note on cgroups v2. From Kubernetes 1.25 onward, cgroup v2 is generally available and is the default on Ubuntu 22.04, Amazon Linux 2023, and Fedora 31 and later. As of the Kubernetes 1.31 through 1.34 range that covers most production support windows in mid-2026, cgroup v2 is effectively standard on new cluster deployments. The CFS quota mechanism works the same way under cgroup v2. The throttle ratio calculation and alert above apply to both cgroup v1 and v2 environments.

Three ways to address it, and what each trades away

There is no single correct fix. The right approach depends on the workload, the acceptable cost, and the operational constraints of the cluster.

Raise the CPU limit

The most direct option. If the container's burst demand exceeds the per-window quota, raising the limit gives it more CPU time per period. A container whose workload periodically requires 80ms of CPU time per 100ms CFS period but holds a 50ms quota can absorb that burst without waiting across multiple windows once the limit is raised.

The cost is resource allocation. Higher limits affect bin packing on nodes, and if requests are set near the limit (a common practice for Guaranteed QoS), they drive up pod resource cost directly. Right-sizing requires monitoring actual burst patterns, not just averages.

Remove the CPU limit entirely

Kubernetes allows containers with no CPU limit. On a node with available capacity, the container can burst freely. The kernel imposes no CFS quota, and throttling disappears for workloads with spiky demand. Members of the Kubernetes sig-node group have documented scenarios where removing CPU limits is the appropriate choice for latency-sensitive workloads.

The risks are real and worth stating clearly. Without a limit, multiple containers on the same node can compete for CPU during concurrent burst periods. In multi-tenant clusters, the absence of limits undermines cost governance and fairness guarantees. One noisy workload can degrade neighbors on the same node. Removing limits without continuous monitoring of node-level CPU saturation is not advisable, and many organizations that run shared infrastructure retain limits precisely because of these concerns. This is not a universally recommended approach; it is a viable option in clusters where the implications are understood and managed.

Use the Vertical Pod Autoscaler (VPA)

VPA analyzes historical CPU and memory usage and recommends, or in Auto mode automatically sets, requests and limits based on observed behavior. For workloads where burst demand is consistent but hard to predict manually, VPA removes the guesswork from limit sizing.

The main operational limitation is that VPA in Auto mode restarts pods to apply new values. For latency-sensitive services, that means planned disruption. Many teams run VPA in Off mode to collect recommendations and apply changes during maintenance windows.

Consider the CPU Manager static policy for performance-critical workloads

For pods running as Guaranteed QoS class, Kubernetes offers a CPU Manager static policy that pins container threads to dedicated physical CPUs. This eliminates CPU scheduling jitter caused by the container sharing CPU time with other processes on the same core. It does not remove the CFS quota enforcement, but it does reduce the variability that makes throttling harder to predict.

CPU Manager static policy is configured at the kubelet level (cpuManagerPolicy: static) and applies only to Guaranteed pods with integer CPU requests. It is most useful for workloads where consistent low-latency matters more than efficient resource packing.

Confirm before changing anything

The debugging sequence matters as much as the fix.

Confirm throttling with the Prometheus metrics before adjusting limits. A latency spike with a throttle ratio below 0.10 points elsewhere: a slow downstream dependency, connection pool exhaustion, or memory pressure. Raising CPU limits will not help those cases.

When you do raise a limit, do it in one step and monitor the throttle ratio and p99 latency together. If the ratio drops and latency improves, the diagnosis was correct. If neither changes meaningfully, look elsewhere.

Takeaway

CPU utilization and CPU throttling measure different things. Utilization is an average across long windows. Throttling is an event that happens within a 100ms enforcement period, and it is invisible to most standard dashboards.

The container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total metrics are the only reliable way to confirm it. Adding the throttle ratio to standard observability costs nothing to instrument and occasionally explains incidents that have no other obvious cause.

Tags
KubernetesPerformanceKubernetes CPU throttlingCgroupslinuxobservability
Maximize Your Cloud Potential
Streamline your cloud infrastructure for cost-efficiency and enhanced security.
Discover how CloudOptimo optimize your AWS and Azure services.
Request a Demo