Karpenter delivers the biggest savings in clusters with 50+ nodes and unpredictable workloads. For smaller, stable clusters with fewer than 20 nodes, the migration effort often outweighs the savings.
It starts on a normal Tuesday morning. An alert fires because the AWS bill has jumped over the last three days.
The team checks the numbers. Traffic is flat. No new features were released. Yet the main EKS cluster is suddenly costing much more.
After investigating, they find the culprit: a small change to a background worker deployment added a strict pod anti-affinity rule. The kube-scheduler obeyed the rule, spreading pods across separate nodes. The Cluster Autoscaler responded by launching dozens of new nodes.
The pods used only about 10% of each node's capacity, leaving the remaining 90% idle. The applications ran perfectly, but the company was paying for a huge amount of unused infrastructure.
This is a common Kubernetes problem. The biggest cost issue is often not the resources applications need, but the unused capacity trapped inside poorly utilized nodes.
That's why Karpenter is attracting so much attention. By improving how capacity is provisioned and consolidated, it can reduce this waste significantly. The question is: when do the savings become large enough to justify the migration effort?
Why EKS Managed Node Groups Waste Money
To understand why Karpenter saves money, we first have to look at why the old way wastes it. EKS Managed Node Groups use fixed groups of identical servers (called Auto Scaling Groups). They were designed a long time ago for simple setups, not for modern, busy containers.
The Problem with Fixed Server Sizes
When an engineer sets up an EKS Managed Node Group, they usually pick one specific server size. For example, they might pick a large server with 4 CPUs and 16 GB of memory. Every single server that gets added to this group will be exactly that size.
If your website gets a spike in traffic, Kubernetes asks for more pods. If your current servers are full, the Cluster Autoscaler steps in. The problem is, the Cluster Autoscaler can only choose from the node groups you configured earlier. It does not dynamically search across AWS for the most cost efficient server size for those new pods. Instead, it launches another server from one of the predefined node groups.
Buying Too Much Space
Imagine three small pods need to run. Together, they only need 1.5 CPUs and 3 GB of memory.
The Cluster Autoscaler buys a brand-new large server (4 CPUs and 16 GB of memory) just for them. The three pods go onto the new server, and the app handles the traffic spike perfectly.
But financially, this is terrible. You just bought a massive server for a very small job. More than half of that server is now sitting totally empty. You pay the full monthly price for it, but you are throwing 60% of that money in the trash. When you multiply this mistake across 100 or 200 servers, you end up with a massive cloud bill.
The Waiting Game Penalty
The old Cluster Autoscaler can also be slow. Depending on the cluster setup, it may take several minutes to launch a new server and get it ready.
For busy websites, a 4-minute wait is too long. The website will crash or slow down while waiting for the server. To stop this from happening, engineers run extra, empty servers all the time, just so they have space ready immediately. This "safety buffer" is basically a very expensive monthly insurance policy.
The Hidden Cost Nobody Notices: The Kubernetes Tetris Problem
Kubernetes scheduling is like playing a never-ending game of Tetris. Pods are the blocks, and servers are the board. When blocks drop in messy ways over time, you get awkward gaps.
Imagine you have three servers. Over a week, pods come and go. Eventually, Server A is 60% full, Server B is 50% full, and Server C is 60% full.
On paper, you have plenty of free CPU space left. A manager thinks you have room for more apps.
Suddenly, a large database pod needs to start. It needs a big chunk of CPU. Even though you have enough free space in total, the space is chopped up across three different servers. The large pod cannot fit into any of the small gaps.
So, what does the Cluster Autoscaler do? If the large pod cannot fit onto any existing node, it scales one of the available node groups and launches another server.
Now you are paying for four big servers, and those messy gaps on the first three servers stay empty forever. The old Autoscaler does not clean up this mess. It just leaves poorly packed servers running, draining your budget every single hour.
| Node Name | Total CPU Capacity | Allocated CPU | Idle CPU | Node Utilization |
|---|---|---|---|---|
| Node A | 4 vCPU | 2.5 vCPU | 1.5 vCPU | 62.5% |
| Node B | 4 vCPU | 2.0 vCPU | 2.0 vCPU | 50.0% |
| Node C | 4 vCPU | 2.5 vCPU | 1.5 vCPU | 62.5% |
| Cluster Total | 12 vCPU | 7.0 vCPU | 5.0 vCPU | 58.3% |
How Karpenter Fixes the Money Leak
Karpenter is an open-source tool created by AWS. It completely changes how Kubernetes provisions servers by throwing away the old idea of using fixed, identical node groups.
Provisioning the Exact Right Size
If pending pods need exactly 1.5 vCPUs, Karpenter evaluates a huge range of available AWS instance types and picks one that fits the workload efficiently. By always launching the perfect size in real-time, Karpenter stops the massive waste of the old method.
Active Cleanup (Consolidation)
Karpenter constantly watches the cluster's servers. If scheduling rules allow it, Karpenter figures out that it can safely move the pods from Server B onto the empty spots in Server A and Server C.
It moves the pods carefully (strictly respecting Pod Disruption Budgets), then instantly turns off Server B. This leaves the cluster tightly packed, requiring fewer servers. When Grafana Labs migrated to Karpenter, their idle capacity ratios dropped by an average of 50%.
Faster Provisioning Means Less Overprovisioning
Because it can react much faster to changing demand, many teams can reduce the amount of expensive standby capacity they keep running.
The Real Cost Savings: High-Scale Clusters
So, is Karpenter worth the hard work to set up? We have to look at the real numbers.
The money you save depends entirely on how messy your cluster is right now. A good way to guess your savings is this simple formula:
| Monthly Savings = Current Monthly Compute Spend × Expected Waste Reduction |
Across many real-world examples, companies usually see their computer costs drop by 20% to 40% in the first few months. For really messy setups, the savings can be over 50%.
Expected Savings by Cluster Size
Here is what you can realistically expect when moving from the old way to Karpenter:
| Cluster Size | Typical Waste Today | Expected Saving | Payback Period |
|---|---|---|---|
| Small (under 20 nodes) | 10–20% wasted space | 5–10% bill reduction | 2+ years — not worth it |
| Medium (around 50 nodes) | 25–35% wasted space | 20–30% bill reduction | 6–8 months |
| Large (100+ nodes) | 35–50% wasted space | 30–40% bill reduction | A few weeks |
| Huge (200+ nodes) | 40–60%+ wasted space | 40–50%+ bill reduction | Almost immediate |
In poorly optimized 100 node clusters, it is not unusual to find the equivalent of dozens of servers worth of unused capacity.
When you have hundreds of servers, fixing this saves a massive amount of money. For example, Salesforce moved over 1,000 EKS clusters to Karpenter. They saved a ton of money right away, and it cut their manual server management work by 80%.
Spot Instances (The Cheap Servers)
The biggest way to save money on AWS is using "Spot Instances." Spot instances are spare AWS capacity that AWS sells at significant discounts, sometimes reaching as much as 90% compared to On Demand pricing.
The Old Problem with Spot Servers
If the old Autoscaler relies on one specific type of Spot server, and AWS runs out of that type, your cluster gets stuck. Your apps slow down and things break. Engineers try to fix this by writing complicated rules, but it requires constant babysitting.
How Karpenter Makes Spot Servers Easy
You just tell Karpenter it is allowed to use a wide variety of server types. If one type of Spot server runs out, Karpenter instantly buys a different type instead. Many teams configure Karpenter to fall back to regular On Demand servers when Spot capacity becomes unavailable so applications stay online.
Karpenter also improves Spot usage by spreading workloads across many different Spot capacity pools. This gives it more options when capacity changes and helps reduce the risk of interruptions.
Real companies use this to save huge amounts of money. A tech company called Grover moved to Karpenter and significantly expanded its use of Spot capacity across production workloads. Another company called Tinybird used Karpenter's Spot features to cut their testing costs by an amazing 90%.
The Hard Work Nobody Talks About
To know if Karpenter is really worth it, the engineering hours required to install it must be counted. It requires expensive senior engineers, deep testing, and careful planning.
The Migration Effort
- IAM Roles: Karpenter needs powerful permissions to launch and delete AWS servers. Engineers must build, test, and audit these security roles very carefully.
- Network Redesign: Because Karpenter can create and remove nodes more dynamically, teams need to pay closer attention to subnet sizing, IP availability, and ENI limits. Engineers must check subnet sizing and ENI limits before turning Karpenter on.
- Pod Disruption Budgets: Because Karpenter actively moves pods around to save money, it can accidentally crash applications if proper protections are not in place. Every app needs Pod Disruption Budgets (PDBs) telling Karpenter how many pods must stay running at all times.
The Two-Week Plan
In the second week, they start moving the real web traffic over. Finally, during a safe, off-peak maintenance window, they turn on the consolidation feature and slowly drain and delete the old Cluster Autoscaler node groups.
Counting the Cost
The big financial question is: How fast do the AWS savings pay for the high cost of a migration? You must count the migration expenses, the senior engineering effort, and the weeks spent on stability testing.
The Break-Even Point
The Break-Even Point is the moment when your accumulated cost savings from Karpenter are equal to the total one-time cost of migration (engineering hours, testing, etc.). The goal is to reach this point as quickly as possible. We use this simple methodology:
Break-Even Point = Total Migration Cost / Monthly Savings.
When Karpenter is NOT Worth It
- Small Clusters (Under 20 Nodes): As the math above shows, small clusters do not spend enough money to matter. Engineering time is better spent building better apps instead of wrestling with Karpenter.
- Boring, Stable Clusters: If a cluster just runs an internal tool and requires exactly 10 servers every day of the year, Karpenter won't help much. Karpenter is good at handling fast changes. If nothing ever changes, Karpenter cannot save money.
- Company Already Bought Fixed AWS Plans: If a company signed a 3-year contract with AWS to buy a specific type of server (Standard Reserved Instances or EC2 Instance Savings Plans), the budget is already locked in. Even if Karpenter turns those servers off, AWS will still charge. (Note: flexible "Compute Savings Plans" are fine and work well with Karpenter, but strict "Standard RIs" ruin benefits).
- EKS Auto Mode is Better for Greenfield: In late 2024, AWS released "EKS Auto Mode." It is a managed service that automates much of the infrastructure management work that teams traditionally used Karpenter for. (However, if there is an older, complex cluster with custom CNI configs, Auto Mode is very hard to retrofit).
When Karpenter is a Financial Necessity
Big Clusters (100+ Nodes): At this size, messy, underutilized servers are practically guaranteed. A 100-node cluster running half-empty is burning massive amounts of company money. Deciding not to migrate is deciding to waste cash.
Super Fast, Churning Workloads: If developers run thousands of tiny automated tests every day that start and stop in minutes (CI/CD jobs), the old Autoscaler cannot keep up. Karpenter is built perfectly for this. It boots servers in seconds and cleans them up instantly. This is exactly how Tinybird cut their testing costs by 90%.
Management Demands Spot Instances: If management requires aggressive Spot adoption to reduce costs, Karpenter is often one of the easiest and most operationally efficient ways to do it safely. Its ability to instantly switch between dozens of server types is the only way to survive when AWS suddenly takes Spot servers away.
Simple Decision Checklist
If you are facing pressure to cut your cloud bill, use this simple checklist. If you answer "Yes" to more than three of these, Karpenter is worth the hard work.
| Check This | Team Situation |
|---|---|
| Does the cluster have more than 40 active servers? | Yes / No |
| Is cluster memory or CPU often less than 65% full? | Yes / No |
| Do apps slow down because the old Autoscaler takes several minutes to add servers? | Yes / No |
| Are there jobs (like batch processing or CI/CD) that start and stop rapidly all day? | Yes / No |
| Is there an executive order to use more cheap Spot servers? | Yes / No |
| Is the AWS bill for raw compute power painfully high? | Yes / No |
| If you answered Yes to 4 or more: Your current setup is most likely giving free money to AWS. The Karpenter migration will pay for itself. |
Frequently Asked Questions
How does Karpenter actually save money?
It fixes underutilized servers in three ways. First, it provisions the exact right size server (workload-driven sizing) instead of defaulting to massive ones. Second, Karpenter consolidation actively moves pods around to empty out half-used servers and turns them off. Third, it makes Spot instance optimization incredibly safe by automatically diversifying the servers. Most teams see a 20% to 40% drop in their bills.
Is Karpenter actually cheaper than Cluster Autoscaler?
Yes, because they handle EKS cost optimization differently. The old Cluster Autoscaler can only scale the node groups that operators have already configured. This leads to massive empty spaces (node-group-driven scaling). Karpenter, on the other hand, looks at the exact resources pods need and launches the exact shape and size required for active workloads.
What cluster size benefits most from Karpenter?
For clusters with more than 50 servers, it pays for itself fast. With over 100 servers, the massive savings pay back the setup cost in just a few weeks. If there are under 20 servers, the financial return is usually not worth the engineering headache.
Does Karpenter support Spot Instances out of the box?
Yes, and it is much easier than legacy setups. The old way requires writing complicated mixed-instance policies in Auto Scaling Groups. With Karpenter, teams simply list Spot as an allowed capacity type in the NodePool, and Karpenter handles the rest, and many teams configure it to fall back to On-Demand capacity if Spot capacity runs out.
Can all old Managed Node Groups be deleted?
Almost completely. For safety, one extremely small group of regular servers should be kept just to run the Karpenter background process itself. This stops Karpenter from accidentally deleting the server it is running on during a consolidation loop. All other workloads can be fully handed over to Karpenter.
How long does a Cluster Autoscaler to Karpenter migration typically take?
For a real production cluster, expect it to take about two weeks and around 80 to 120 hours of senior engineering time. If experienced engineers are not available, teams must carefully fix security permissions (IAM/IRSA), check network limits (VPC IP math and ENI limits), write Pod Disruption Budgets, and move workloads over slowly to avoid breaking anything.






