Stop Waiting for Nodes: Mastering Just-in-Time Provisioning on EKS with Karpenter and Spot Instances
Slash EKS costs and eliminate pending pods. Discover how Karpenter's just-in-time provisioning and AWS Spot Instances revolutionize Kubernetes scaling.
For any organization running Kubernetes at scale, the "Pending" pod status is the stuff of nightmares. It represents latency, wasted potential, and ultimately, a poor user experience. Traditionally, the AWS Elastic Kubernetes Service (EKS) relied on the Cluster Autoscaler—a robust, albeit rigid, tool tied to AWS Auto Scaling Groups (ASGs). While effective, it often feels like trying to steer a cruise ship when you need the agility of a speedboat.
In the modern cloud-native landscape, agility and cost-efficiency are paramount. CTOs and DevOps leads are constantly balancing the need for high availability with the mandate to reduce cloud spend. This is where the convergence of Karpenter and AWS Spot Instances creates a paradigm shift.
At Nohatek, we have helped numerous clients migrate from static node groups to dynamic, intent-based scaling. In this post, we will explore how Karpenter’s just-in-time provisioning allows you to stop waiting for nodes, and how integrating Spot Instances can reduce your compute costs by up to 90% without sacrificing reliability.
The Friction of Traditional Autoscaling
To understand the solution, we must first diagnose the problem. The standard Kubernetes Cluster Autoscaler (CAS) works by adjusting the size of AWS Auto Scaling Groups. When a pod fails to schedule due to resource constraints, CAS instructs the ASG to spin up a new node.
However, this approach introduces significant friction:
- Latency: Bootstrapping a node via an ASG can take several minutes. During this time, your application is stalling.
- Constraint Rigidity: ASGs are typically homogeneous. If you have an ASG for
m5.largeinstances, but a pod requires a massive amount of memory, CAS cannot simply provision a memory-optimized node unless you have pre-configured a specific node group for it. - Over-provisioning: To mitigate slow scaling, teams often over-provision their clusters, paying for idle compute capacity just to provide a safety buffer.
"The Cluster Autoscaler was designed for a world where infrastructure changed slowly. Karpenter is designed for a world where infrastructure is ephemeral."
This rigidity forces engineers to manage infrastructure rather than applications. You end up managing dozens of node groups, complex labeling strategies, and taints/tolerations just to get workloads to land where they fit.
Enter Karpenter: Groupless, Just-in-Time Provisioning
Karpenter is an open-source node provisioning project built for Kubernetes. Unlike the Cluster Autoscaler, Karpenter bypasses the Auto Scaling Group entirely. It sits within your cluster and observes the Kubernetes API server for unschedulable pods.
When Karpenter sees a pending pod, it evaluates the resource requirements (CPU, memory, volume topology) and scheduling constraints (affinity, anti-affinity) in real-time. It then calls the AWS EC2 Fleet API directly to provision the exact right node for that workload.
Here is why this is revolutionary for EKS environments:
- Speed: By skipping the ASG abstraction, Karpenter can bind a pod to a node in seconds, not minutes.
- Bin Packing: Karpenter is an expert at "Tetris." It looks at the aggregate resource requests of all pending pods and calculates the most efficient instance type to house them. If you have a batch of small jobs, it might pick a large instance to host them all. If you have a massive AI training job, it picks a GPU-optimized instance.
- Consolidation: Karpenter isn't just about scaling up; it's aggressive about scaling down. It constantly looks for underutilized nodes, evicts the pods to other nodes with spare capacity, and terminates the empty node to stop the billing meter.
This "groupless" approach simplifies operations significantly. You define NodePools (formerly Provisioners) using YAML, allowing you to set high-level constraints without managing individual servers.
The Spot Instance Strategy: High Availability at Low Cost
The technical agility of Karpenter unlocks the financial power of AWS Spot Instances. Spot Instances are spare EC2 capacity available at up to a 90% discount compared to On-Demand prices. The catch? AWS can reclaim them with a two-minute warning.
Traditionally, using Spot with EKS was risky. If AWS reclaimed a node, the ASG might be slow to replace it, causing downtime. Karpenter changes this dynamic entirely.
Handling Interruption Gracefully
Karpenter natively handles Spot Interruption Notifications. When AWS sends a reclamation signal, Karpenter immediately begins cordoning the node and provisioning a replacement before the instance is terminated. This "brake-and-turn" capability makes Spot viable for a much wider range of workloads, including microservices and API servers.
Price-Capacity-Optimized Strategy
To maximize availability, you shouldn't bet on a single instance type. Karpenter allows you to define a wide array of acceptable instance families. Below is an example of a NodePool configuration that prioritizes Spot instances but allows for diversification:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["2"]
nodeClassRef:
name: default
limits:
cpu: 1000
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720hIn this configuration, Karpenter is instructed to look for Spot capacity across Compute (c), General Purpose (m), and Memory (r) families. If the cheapest option is out of stock, it instantly moves to the next cheapest option, ensuring your pods always have a place to run.
Implementation: Best Practices for Production
While Karpenter simplifies scaling, implementing it in a production environment requires a strategic approach. At Nohatek, we recommend a phased rollout to ensure stability.
1. Split Your Workloads
Don't migrate everything at once. Create a separate NodePool for stateless, fault-tolerant workloads (like batch processing or CI/CD runners) to test Spot reliability. Keep your critical, stateful databases on On-Demand instances or a separate, more conservative NodePool initially.
2. Refine Your Pod Disruption Budgets (PDBs)
Because Karpenter is aggressive about consolidation (deleting nodes to save money), your application must be resilient to voluntary disruptions. Ensure you have PDBs configured so that Karpenter knows how many replicas must remain online during a reshuffling event.
3. Tagging for Cost Allocation
Karpenter dynamically creates tags for the EC2 instances it launches. Ensure your EC2NodeClass configuration includes the necessary Cost Center or Project tags. This is crucial for CTOs to visualize the savings in AWS Cost Explorer and attribute them to specific teams.
Security Note: Karpenter requires permissions to launch instances. We recommend using IRSA (IAM Roles for Service Accounts) to adhere to the principle of least privilege, ensuring the controller only has the permissions it needs to manage the fleet.
The combination of EKS, Karpenter, and Spot Instances represents the maturity of cloud-native infrastructure. It moves us away from static, manual capacity planning toward a truly elastic, intent-based model. For businesses, this translates to faster innovation cycles and significantly reduced cloud overhead.
However, configuring the interplay between bin-packing algorithms, interruption handling, and high-availability requirements can be complex. If you are looking to optimize your EKS clusters or need guidance on implementing just-in-time provisioning, Nohatek is here to help.
Don't let your infrastructure be the bottleneck. Contact our cloud engineering team today to schedule an architecture review and start scaling smarter.