August 22, 2022
August 2, 2023
We will explore the key role of autoscaling in optimizing performance and cost within Kubernetes, a popular container orchestration platform. Specifically, we'll delve into two critical autoscalers—Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA)—and shed light on their functionalities, benefits, and limitations. This is based on a talk I gave at our annual conference, autocon/22. You can view the video here.
In today's digital landscape, ensuring application scalability, reliability, and high availability is crucial for companies aiming to meet the demands of ever-increasing traffic. However, many organizations find themselves facing uncertainties about whether their current infrastructure can handle exponential growth. Fluctuating app traffic patterns throughout the day, coupled with the occasional need for substantial batch processing, present challenges that require careful resource management. To effectively meet these dynamic traffic demands, companies often consider allocating additional resources. However, this approach can lead to excessive spending during non-busy hours, resulting in resource wastage. Nevertheless, this wastage is a preferable alternative to experiencing detrimental outages that can tarnish a company's reputation.
According to a report by Gartner, the global cloud spending is projected to surpass a staggering $482 billion in 2022. As cloud adoption continues to rise, so does the issue of wastage. Flexera's 2021 State of the Cloud Report reveals that, on average, companies waste approximately 35% of their cloud expenditure. Additionally, a study conducted by Datadog highlights that the median Kubernetes deployment only utilizes 20 to 30% of requested CPU and 30 to 40% of requested memory. This not only results in financial losses but also poses potential security risks. So, why do organizations tend to overprovision? Most commonly, they overprovision due to concerns about performance issues and a desire to avoid disappointing their users.
As cloud adoption continues to rise, the issue of wastage becomes more prevalent. Reports indicate that companies waste an average of 35% of their cloud expenditure, and Kubernetes deployments often utilize only a fraction of requested CPU and memory resources. Overprovisioning is commonly driven by concerns about performance issues and a desire to avoid disappointing users.
As many of you may have already guessed, the solution for optimizing cost and performance lies in auto scaling. According to AWS, auto scaling is the primary pillar for achieving this goal. It monitors your applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost. here
In Kubernetes, there are two approaches to scaling. First, you can scale at the node level, which involves adding more nodes for horizontal scaling or changing the instance types for vertical scaling. Second, there is pod-level scaling, which can be achieved using the Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA). With HPA, you can increase the number of pods, while VPA allows you to add more resources to the pods.
The Horizontal Pod Autoscaler (HPA) (see image below ) adjusts the number of pods dynamically based on computational needs. By monitoring metrics and updating pod replicas within the deployment or replica controller, the HPA adds or removes pods according to traffic demand. It primarily focuses on CPU and memory metrics and can accommodate custom metrics.
A pod can be visualized as an application instance, consisting of multiple containers that function as a cohesive unit. The Horizontal Pod Autoscaler (HPA) plays a crucial role in managing these pods. When there is an increase in computational needs due to traffic demand, the HPA adds pods dynamically. Conversely, it removes pods when there is a decrease in resource usage. Although the HPA primarily relies on CPU and memory metrics, it can also accommodate certain custom metrics.
Let's dive deeper into how the Horizontal Pod Autoscaler operates. The HPA continuously monitors metrics every 30 seconds. Once a predefined threshold is reached, it updates the number of replicas within the deployment or replica controller, resulting in the addition or removal of pods. A cooldown period of approximately three to five minutes follows each scale-up or scale-down operation. However, it's important to note that the HPA's effectiveness is dependent on the availability of resources and space within the cluster. If there is insufficient capacity, the HPA cannot schedule additional pods.
To address limitations related to resource availability within the cluster, the HPA can be complemented with a cluster autoscaler. The cluster autoscaler adds more nodes and resources to the cluster, ensuring effective pod scheduling and optimal resource utilization. (see image below )
Let's delve into how the cluster autoscaler operates. It periodically examines the status of pending pods, checking every 10 seconds. When it detects a pod in a pending state, it initiates communication with the cloud provider. The cloud provider then attempts to allocate a node to accommodate the pending pod. Once the node is successfully allocated, it becomes part of your cluster and becomes capable of hosting pods. Subsequently, the Kubernetes scheduler assigns the pending pods to the newly added node.
While HPA and cluster autoscaler are effective for horizontal scaling, not all applications can easily scale in that manner. Stateful applications like databases often face challenges when it comes to horizontal scaling, as adding new pods can be complex. These applications often require techniques like sharding or read-only replicas. However, it is possible to improve their performance by adjusting CPU and memory resources. (see image below )
This is where the Vertical Pod Autoscaler (VPA) comes into play. VPA focuses on scaling the resource requests and limits of individual pods. If an application is underutilizing its allocated resources, VPA will scale up the resource requests. Conversely, if an application is over utilizing resources, VPA will scale down the allocations.
Let's explore how the Vertical Pod Autoscaler functions. VPA monitors metrics at regular intervals, typically every 10 seconds. When a predefined threshold is reached, VPA updates the resource specifications in the deployment or replica controller. However, it's important to note that Kubernetes does not currently support in-place replacement of resources, which means that pods may need to be restarted when adjustments are made. Similar to HPA, VPA also
incorporates a cooldown period of three to five minutes for scaling up and scaling down operations.
Determining the appropriate use cases for HPA and VPA is a key consideration. HPA proves most effective for stateless applications that can scale horizontally with ease. It is also well-suited for applications that experience regular fluctuations in demand, such as seasonal variations. However, to maximize cost savings during off-peak hours, it is advisable to utilize
HPA alongside a cluster autoscaler. This combination allows for efficient resource allocation and monetary optimization. ( see image below )
On the other hand, VPA serves as the ideal solution for stateful applications, particularly those involving databases. VPA's functionality is relatively new in the Kubernetes ecosystem, and many companies opt to employ it in recommendation mode rather than auto-update mode. This approach enables organizations to gain insights into an application's resource utilization profile without making automatic adjustments.
By discerning the specific characteristics and requirements of your applications, you can determine whether HPA or VPA is the more suitable choice, ensuring optimal scalability and resource management.
While HPA and VPA may appear to be promising solutions, the reality is that only a limited number of companies are utilizing them in their production environments. This is primarily due to various limitations associated with these Kubernetes autoscalers.
One significant limitation is the considerable overhead involved in configuring HPA and VPA correctly. This process entails identifying the appropriate auto scaling metrics from a plethora of options. Additionally, benchmarking and profiling are necessary to determine the optimal configuration values for your code. Fine-tuning the metric server refresh intervals, HPA refresh intervals, and VPA refresh intervals is also essential. Moreover, careful consideration must be given to cluster capacity, node architecture, and size, all of which need to be repeated for each iteration. This overhead poses a significant burden for administrators and users.
Another issue arises if something goes wrong during scaling, such as an increase in error rates. Unfortunately, HPA and VPA lack reinforced learning capabilities, meaning they will continue to scale up or down even in the presence of increased error rates.
Furthermore, most companies are aware of their application's seasonality and desire proactive scaling to optimize efficiency. However, Kubernetes does not currently offer predictive scaling capabilities, leaving companies without a means to anticipate workload fluctuations.
It is worth noting that HPA and VPA cannot be used together as they may conflict with each other's scaling behavior. Additionally, there are specific limitations for each autoscaler. VPA necessitates the presence of at least two healthy pod replicas, while HPA cannot scale down to zero. Moreover, VPA requires a minimum memory location of 250 MB by default, making it unsuitable for small applications. Furthermore, VPA cannot be utilized for pods lacking an owner and deployed solely as pods rather than as part of a deployment or a stateful set. ( see image below )
Like HPA and VPA, cluster autoscaler also has its share of limitations. Firstly, there is limited support for the open-source version of cluster autoscaler, as it does not cover regional instance scripts. Additionally, there are restrictions on the maximum number of nodes it can handle concurrently. Currently, it can handle up to a thousand nodes, with each node accommodating a maximum of 30 pods. Moreover, the cooldown period for scale downtime is relatively long, set at 10 minutes. When associated with a pod node selector, scaling up is not performed. Scaling down is restricted to a maximum of 10 nodes at a time, and the CPU usage must be at least 20% for scaling down to occur.
Furthermore, there are challenges related to pods (as shown in the image below). If a pod disruption budget and local storage are associated with a pod, the cluster autoscaler will not scale it down.
In summary, scaling is crucial for optimizing performance and cost, and Kubernetes offers a solid foundation for it. However, there is significant overhead for administrators in terms of choosing auto scaling metrics, architecting cluster capacity, and benchmarking applications for configuration decisions. This overhead is repeated for each iteration of the code, posing a considerable burden. Therefore, there is a need for a true autonomous system that can address these challenges.
In the upcoming articles, we will be talking more about Autoscaling Kubernetes, cluster autoscaling, scaling tools and event-driven autoscaling. To get a preview check out the autocon/22 video here
Q: How do you configure HPA and VPA for stateful workloads?
A: It is recommended not to use HPA for stateful workloads. As for VPA, the configuration process is similar to other resources. However, it requires thorough baselining, including benchmarking the application, determining the right configuration, fine-tuning metric server refresh intervals, and optimizing node instance types and cluster capacity.
Q: What is Keda and how does it relate to Kubernetes auto-scaling?
A: KEDA is the prominent tool. With KEDA, you can trigger scaling and manage HPAs. It supports multiple event sources such as SQS queues, Lambda events, and more. Based on these events, you can trigger HPA or create other scaling objects, even scaling from zero.
Q: Apart from scaling, what other factors should operators consider to manage performance for Kubernetes deployments?
A: Setting appropriate CPU and memory limits is crucial. For example, if you set low CPU in the request and a higher value in the limit for a job application, the node may not be able to provide the required resources when the workload demands more. In such cases, the node will terminate the pod. To avoid this, you need to ensure you set the right metrics and limits for your containers.