Beyond Cluster Autoscaler: Optimizing Kubernetes Costs with Karpenter and Spot Instances
Slash your AWS EKS bill and improve scaling speed. Learn how to migrate from Cluster Autoscaler to Karpenter and leverage Spot Instances effectively.
For many CTOs and engineering leads, the monthly cloud bill is a source of recurring anxiety. While Kubernetes (K8s) has revolutionized container orchestration, it has also introduced a complex layer of resource management that, if left unchecked, can bleed budget rapidly. The default scaling mechanism for years—the Kubernetes Cluster Autoscaler (CA)—is robust, but it was designed for an era of static node groups and slower scaling requirements.
In the fast-paced world of AI workloads, microservices, and fluctuating traffic, the traditional Cluster Autoscaler often results in over-provisioning and wasted compute resources. Enter Karpenter, an open-source node provisioning project built for AWS (and expanding elsewhere). When combined with the strategic use of Spot Instances, Karpenter doesn't just shave a few percentage points off your bill; it can fundamentally change the economics of your infrastructure.
In this post, we will explore why the industry is moving beyond the standard Cluster Autoscaler, how Karpenter changes the game with "groupless" autoscaling, and how to safely leverage Spot Instances to reduce compute costs by up to 90% without sacrificing reliability.
The Bottleneck: Why Cluster Autoscaler Falls Short
To understand the solution, we must first diagnose the problem. The traditional Kubernetes Cluster Autoscaler works by interfacing with AWS Auto Scaling Groups (ASGs). When a pod fails to schedule due to insufficient resources, the CA detects this pending state and increases the DesiredCapacity of a specific ASG.
While functional, this approach introduces several inefficiencies:
- The "Bin Packing" Problem: CA is constrained by the instance types defined in your node groups. If you have a node group of
m5.2xlargeinstances, but your pending pod only needs 0.5 vCPU, CA spins up a massive node, leaving the rest of that capacity wasted until other pods fill it. - Slow Spin-Up Times: The chain of command is long. K8s talks to CA, CA talks to the ASG, the ASG launches an EC2 instance, the instance boots, joins the cluster, and finally, the scheduler places the pod. This latency can take several minutes—an eternity during a traffic spike.
- Complex Management: To optimize for different workloads (GPU, High Memory, General Purpose), DevOps teams often end up managing dozens of distinct node groups. This creates operational overhead and fragmentation.
The Cluster Autoscaler simulates scheduling to determine if a node group expansion will help. Karpenter bypasses this simulation and simply asks the cloud provider for exactly what is needed.
For organizations running high-performance workloads or those looking to optimize cloud spend, these limitations are no longer just annoyances; they are financial liabilities.
Enter Karpenter: Just-in-Time Provisioning
Karpenter takes a radically different approach. It bypasses Auto Scaling Groups entirely. Instead, it acts as a direct operator between your Kubernetes cluster and the AWS Fleet API. It observes the aggregate resource requests of your unschedulable pods and makes a decision in milliseconds to provision the exact right compute resources.
Here is why this is a paradigm shift for Kubernetes economics:
- Groupless Autoscaling: You no longer define specific node groups for specific instance types. You define constraints (e.g., "I need x86 architecture"), and Karpenter selects the cheapest instance type that fits the pending pods. If you have a massive pod, it picks a large instance. If you have a few small pods, it picks a small instance.
- Rapid Response: By cutting out the ASG logic, Karpenter can bind pods to nodes the moment the node is created (via a feature called proactive binding), drastically reducing the time from "pending" to "running."
- Consolidation (De-provisioning): This is perhaps Karpenter's superpower. It actively watches for underutilized nodes. If it sees a node that is only 20% utilized, and those pods can fit on other existing nodes or a smaller, cheaper new node, Karpenter will automatically cordon, drain, and terminate the expensive node to consolidate the workload.
By treating the cloud as a fluid pool of resources rather than rigid groups of servers, Karpenter ensures you are paying only for the compute you actually use, minimizing the "slack" space in your cluster.
The Secret Weapon: Strategic Spot Instances
The combination of Karpenter and Amazon EC2 Spot Instances is where the ROI becomes undeniable. Spot Instances allow you to use spare EC2 capacity for up to 90% off the On-Demand price. However, the catch has always been that AWS can reclaim these instances with a 2-minute warning.
Historically, CTOs have been hesitant to run production workloads on Spot due to this volatility. Karpenter mitigates these risks significantly through intelligent diversity and rapid handling.
The Price-Capacity-Optimized Strategy
Karpenter uses the price-capacity-optimized allocation strategy. When it requests Spot instances, it doesn't just look for the cheapest option; it looks for the instance pools that have the deepest capacity (lowest chance of interruption) while still being cost-effective.
Furthermore, because Karpenter launches nodes so quickly, if a Spot instance is reclaimed, Karpenter can replace it almost instantly. For stateless applications, batch processing, or AI training jobs, this is ideal.
Here is a basic example of a Karpenter NodePool configuration that leverages Spot instances:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: spot-optimized
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["c", "m", "r"]
- key: "karpenter.k8s.aws/instance-generation"
operator: Gt
values: ["2"]
limits:
cpu: 1000
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720hIn this configuration, we tell Karpenter: "Give us Spot instances from the C, M, or R families, ensuring they are newer generations, and aggressively consolidate them when they are underutilized." This simple configuration can replace dozens of complex ASG definitions.
Implementing the Shift: Best Practices for Nohatek Clients
Migrating from Cluster Autoscaler to Karpenter is not a "rip and replace" operation; it can be done incrementally. However, to maximize the benefits, we recommend the following strategic steps:
- Tagging is Crucial: Ensure your subnets and security groups are properly tagged so Karpenter can discover them. Without this, the provisioner cannot launch nodes.
- Handle Interruption Gracefully: Ensure your application handles
SIGTERMsignals correctly. When a Spot instance is reclaimed, your pods have minimal time to shut down. ImplementingPodDisruptionBudgetsis essential to ensure high availability during Karpenter's consolidation actions. - Split Workloads: We often advise a hybrid approach. Keep your critical control plane components and stateful databases on On-Demand instances (managed by a small NodePool), and offload your stateless microservices and batch jobs to the Spot-backed NodePools.
- Monitor the Savings: Use tools like Kubecost or AWS Cost Explorer. You should see a distinct drop in EC2 costs, but an increase in the number of instance types used. This diversity is a sign of a healthy, optimized Karpenter cluster.
By automating the selection of compute resources, you free your DevOps team from the toil of capacity planning. They no longer need to guess how many nodes you will need next Black Friday; the system will decide in real-time.
The era of static infrastructure planning is fading. In a modern cloud environment, agility and cost-efficiency are two sides of the same coin. Karpenter represents the next generation of Kubernetes autoscaling—smarter, faster, and significantly cheaper when paired with Spot instances.
While the transition involves architectural considerations regarding statelessness and fault tolerance, the financial upside is too significant to ignore. At Nohatek, we help organizations navigate these cloud-native transitions, ensuring that your infrastructure isn't just keeping the lights on, but actively contributing to your bottom line.
Ready to optimize your Kubernetes footprint? Contact the Nohatek cloud engineering team today for an infrastructure audit.