Cost optimization for Kubernetes on AWS using the right sizing means ensuring that your cloud resources (like virtual machines, storage, and computing power) are the right size for your needs—not too big or too small. Optimization helps save money and improve efficiency.
In simple terms, imagine you have a car rental business. If you rent a big truck to deliver a small package, you are wasting money on fuel and space. But if you rent a tiny car when you need to provide large furniture, you will struggle to fit everything in. Similarly, 'right-sizing' in Kubernetes is about adjusting the size of your resources so they match the workload perfectly—no more, no less.
The Quintype team thoroughly investigated a Kubernetes bin packing issue encountered by one of our technology clients. This blog outlines our approach to resolving the problem and securing the client's Kubernetes infrastructure operating at peak efficiency.
Kubernetes cost optimization is all about getting the best bang for your buck when running your applications in the cloud. It is like finding the perfect balance between using just enough resources for your apps to run smoothly without splurging on things you don't need.
By carefully managing these resources, organizations can achieve cost savings. "Right-sizing" your resources and allocating them for maximum performance and optimal efficiency is not as simple as it seems. When you "right size" your Kubernetes clusters on AWS:
You adjust the size of computing resources (like CPU and memory) so that they can handle your applications effectively without using excess resources.
You ensure you are not overpaying for resources sitting idle and not being used.
You avoid performance issues by ensuring adequate application capacity during peak usage.
It is about balancing cost and performance by tweaking your cloud setup to suit your needs. Let's get into this one by one.
The Bin Packing Problem in Kubernetes refers to the challenge of efficiently scheduling and placing containerized workloads (pods) onto a limited number of nodes (servers) to minimize resource wastage.
The problem is named after the classic "bin packing problem" in computer science, where the goal is to fit items of varying sizes into the fewest bins without exceeding the bin capacity.
Cost Optimization Challenge
Kubernetes can dynamically scale resources, which makes cost estimation difficult. Various factors influence costs, including the number of pods, nodes, load balancing, storage, and data transfer. This complexity can make it difficult to predict or control costs effectively.
Dynamic Resource Allocation & Over-Provisioning
Kubernetes dynamically adjusts resources based on demand, which often results in overprovisioning without proper configuration. Brands end up paying for more CPU, memory, or storage than they need, leading to increased costs.
Idle Resources
Underutilized or idle resources, such as EC2 instances with low CPU or memory usage, incur unnecessary costs. Kubernetes may leave nodes underutilized when they cannot efficiently pack workloads, contributing to EC2 costs without corresponding utilization.
EKS Control Plane Costs
Amazon Elastic Kubernetes Service (EKS) charges a flat rate per EKS cluster for control plane operations, which can add up if multiple clusters are deployed, even if they are not handling high workloads.
High Network Costs
Networking costs can become significant in distributed Kubernetes applications, including data transfer between different AWS services or regions and traffic routing through Elastic Load Balancers (ELB).
The Quintype team thoroughly investigated the Kubernetes bin packing issue faced by one of our technology clients. We found that the cluster nodes were not efficiently utilized due to suboptimal pod placement, leading to fragmented resource allocation.
We identified the mismatch in requests and limits leading to ineffective utilization and instability of the clusters.
TLDR;
Requests and Limits
Resource request is the amount of CPU and memory a container requests when it's scheduled to run.
Resource limit is the maximum amount of CPU and memory a container can use while running. It acts as a ceiling to prevent containers from consuming more resources than intended.
Request and limit mismatch occurs when Kubernetes requests and limits are not aligned. It can be over-provisioning or under-provisioning.
Over-Provisioning: Setting resource requests that are too high relative to actual usage. Kubernetes may schedule fewer containers per node than it can handle, leading to wasted resources. This results in more nodes being provisioned than necessary, increasing cloud costs.
Under-Provisioning: Setting resource requests too low. Kubernetes might pack too many containers on a single node, leading to CPU or memory contention. As a result, the node could become unstable or unresponsive, causing containers to crash or get evicted.
As a result, the client faced excessive cloud costs and operational inefficiencies. To address this, our team implemented a structured four-step solution to optimize resource distribution, achieve better node utilization, and reduce overall expenses.
Root Cause Investigation
The Quintype team analyzed Kubernetes scheduling behaviors using Kubectl, Prometheus, and CloudWatch to identify node resource fragmentation and inadequate pod placement patterns leading to suboptimal utilization.
Data Profiling and Analysis
Profiled CPU and memory usage trends with PromQL and metrics server data, uncovering mismatches between resource requests and usage that triggered unnecessary node provisioning.
Optimization Strategy Development
Quintype team designed an optimization plan by recalibrating pod resource requests /limits, refining autoscaler policies, and implementing custom scheduling configurations (PodAffinity / AntiAffinity and taints/tolerations) to execute efficient bin packing.
Execution and Validation
We deployed optimized configurations using Helm and manifest updates, re-engineered node pools, and continuously monitored improvements with Prometheus and Grafana, resulting in better resource utilization and reduced cloud costs.
The team optimized resources systematically to address the Kubernetes bin packing issue. This consists of a series of technical steps to identify resource inefficiencies, fine-tune configurations, and implement automated monitoring for sustained efficiency.
Assessing Resource Usage
Right-sizing strategies for maximum efficiency
Optimizing Compute Resource on AWS
Automate Resource Usage Monitoring using Castai
To fully understand the cluster's resource utilization, we initiated an in-depth profiling process using Kubernetes-native monitoring tools like Prometheus and the Metrics Server.
The process involved collecting real-time CPU, memory, and storage metrics at multiple levels—container, pod, and node—to get a granular view of resource consumption.
The data collection phase focused on identifying patterns and trends, such as average utilization, peak usage, and idle resource availability.
We analyzed these metrics to detect potential bottlenecks, resource contention points, and underutilized nodes contributing to inefficient bin packing and suboptimal pod scheduling.
Next, we visualized the gathered data using custom dashboards in Grafana to highlight areas of concern, such as nodes constantly running under 20% capacity or pods consuming more than 80% of their allocated resources. By correlating the usage patterns with pod resource requests and limits, we identified discrepancies that led to resource wastage.
For example, we observed that many pods had excessively high resource requests compared to their actual usage, causing the scheduler to unnecessarily spread them across multiple nodes.
This deep profiling provided actionable insights, forming the foundation for the subsequent optimization steps. This involved collecting real-time CPU, memory, and storage metrics at multiple levels—container, pod, and node—to get a granular view of how resources were being consumed.
After thoroughly assessing the cluster's resource usage, the next step was refining pod resource requests and limits based on data-driven insights to achieve optimal node utilization.
The right-sizing process aligns each pod's CPU and memory requests with its actual consumption patterns to ensure that resources are allocated efficiently, thereby reducing node fragmentation and overall cluster costs.
In Step 2, we used the profiling data collected in Step 1 to identify pods with excessive resource requests compared to their real-world utilization. This occurs when resource requests are set too high, causing Kubernetes to allocate more resources than necessary.
For example, we observed that certain web application pods had CPU requests set to 1000m (1 core) while their actual peak usage never exceeded 200m.
Due to this over-provisioning, the scheduler was unable to place additional pods on the same node, resulting in wasted capacity and forcing the cluster to spin up new nodes. This increased infrastructure costs and led to inefficient bin packing, where nodes were left underutilized.
To address this, we reduced the CPU request for these pods from 1000m to 300m, aligning it more closely with the actual usage pattern while providing a safety margin for unexpected spikes. Additionally, we lowered the memory request from 2Gi to 1Gi based on observed peak consumption.
The changes mentioned above allowed the Kubernetes scheduler to pack more pods onto each node, effectively utilizing available resources and reducing the number of nodes required to run the workloads.
By implementing this right-sizing strategy across multiple services, we achieved a more balanced distribution of resources across the cluster, resulting in 40% reduction in EC2 node count and a 55% improvement in overall resource utilization.
Furthermore, we established dynamic resource allocation policies using Kubernetes Limit Range and Resource Quota objects to enforce consistent resource allocation standards across namespaces. This ensured that pods would not over-request resources in the future, maintaining the efficiency gains achieved through right-sizing.
With the new limits and requests in place, the cluster was better equipped to handle variations in workload demand, minimizing resource contention and enhancing application stability.
Once the resource requests and limits were fine-tuned, the next step was to optimize the underlying compute infrastructure on AWS to enhance node utilization and minimize cloud costs.
This step primarily focused on reconfiguring node pools by selecting appropriate EC2 instance types that better matched the workload requirements and tuning the Cluster Autoscaler and Horizontal Pod Autoscaler (HPA) settings to ensure dynamic scaling aligned with resource consumption patterns.
We analyzed the existing node pool configuration and discovered that the client used a uniform set of general-purpose t3.medium instances for all workloads, regardless of their unique resource requirements.
This led to inefficient node utilization, as some nodes were over-utilized with memory-intensive pods while others were underutilized due to CPU-heavy workloads.
To address this imbalance, we categorized workloads based on their resource usage profiles and introduced a mix of EC2 instance types—such as c5.large for CPU-bound services and r5.large for memory-bound services—to better align node capacity with pod demands.
We adjusted the Cluster Autoscaler settings to handle resource scaling better with the optimized instance types in place. We configured the Autoscaler to prioritize scaling up nodes with the highest packing density, thereby minimizing the number of nodes required to accommodate new pods.
Additionally, we refined the HPA thresholds to ensure that pods scaled more accurately in response to CPU and memory usage, preventing unnecessary node provisioning during transient workload spikes.
To ensure sustained efficiency and stability of the optimized Kubernetes infrastructure, we implemented automated monitoring solutions to track resource usage in real time and respond proactively to any anomalies.
This involved deploying a comprehensive monitoring stack using Prometheus for data collection, Grafana for visualization, and custom alerting rules to detect deviations from expected resource consumption patterns.
First, Prometheus was configured to scrape metrics from the Kubernetes API, nodes, and individual pods at regular intervals. This setup provided granular visibility into resource utilization, such as CPU and memory consumption, at different aggregation levels.
We also integrated node exporter and kube state metrics to capture additional cluster health information, including node status, pod lifecycle events, and resource requests versus limits.
With this robust data pipeline, we could generate detailed metrics on resource usage, node availability, and workload distribution across the cluster.
Next, we created custom dashboards in Grafana to visualize the collected metrics. This enabled real-time monitoring of key performance indicators such as average CPU usage, memory utilization, and node saturation levels.
These dashboards were designed with intuitive charts, heatmaps, and trend lines that highlighted potential resource bottlenecks, underutilized nodes, and spikes in resource consumption.
For example, we set up a dashboard to monitor node utilization efficiency. The dashboard displayed the percentage of allocated versus actual usage for each node, making it easy to spot any resource imbalances.
To automate response mechanisms, we defined alerting rules in Prometheus to identify anomalies or deviations from expected behavior. These alerts were configured to trigger when predefined thresholds were crossed, such as CPU utilization exceeding 80% or memory usage falling below 30% for a sustained period.
For example, an alert was set up to detect pods exceeding their resource limits, which could indicate the need for further right-sizing. Upon triggering an alert, notifications were sent to the client's DevOps team through Slack and email, providing detailed information on the affected resources and suggested corrective actions.
Additionally, we implemented automated scaling policies based on these alerts. For instance, if multiple nodes were detected as underutilized over a defined period, the Cluster Autoscaler would automatically scale down nodes to optimize cost and capacity. Conversely, if CPU or memory consumption on nodes spiked due to increased workload demand, the Horizontal Pod Autoscaler would scale out the pods to maintain application performance without manual intervention.
Deploying this automated monitoring and alerting framework ensured continuous visibility and control over the cluster's resource usage. The proactive alerting system enabled the client to respond to issues before they impacted performance, while automated scaling adjustments maintained optimal resource allocation.
This approach stabilized the Kubernetes environment and preserved the cost savings and performance improvements achieved in previous optimization steps, establishing a resilient and self-sustaining infrastructure.
With some smart tweaks and targeted optimizations, we transformed the client's Kubernetes cluster from a costly, underutilized setup into a lean, high-performing environment—saving them over 20% on AWS bills and improving efficiency across the board.
Are you tired of watching your AWS costs skyrocket while your Kubernetes cluster feels like a money pit? Let us help you reduce those costs and make your infrastructure work smarter, not harder. Reach out, and let's turn your Kubernetes headaches into cost-saving wins.