Introducing AI-Powered Automated Rightsizing for Azure VMs

Summary

Azure VMs are poorly utilized. A large scale analysis of Azure virtual machines showed median CPU utilization of 8.2%. Many VMs are oversized, leading to unnecessary costs. Doubling utilization, which is often possible, halves costs.
Other current solutions like Azure Advisor often (1) require manual effort (2) do not use the full set of golden metrics to optimize applications (e.g., using utilization metrics only and not considering latency and errors, and (3) do not perform safety checks to validate the change can be made.
Sedai’s Azure VM optimization finds the lowest cost Azure VM type subject to performance and reliability requirements. It can execute the change for the user, performing a series of safety checks to safely implement the change.
Early users of Sedai's optimization technology have seen reductions in cloud costs without impacting application performance.
Sedai offers both agentless and agent-based deployment options with pricing based on usage levels.

Introduction

In environments where applications are not suitable for microservices architectures, rightsizing and in particular vertical scaling becomes a critical strategy to achieve cost-effective operations while meeting performance requirements. This approach involves choosing the right virtual machine type based on the CPU and memory resources required. Rightsizing is a known best practices for Azure VM cost optimization but is hard to implement in practice.

The Rightsizing Problem

Less than 20% of Azure VM capacity is used

Analysis of the most recently released Azure VM usage dataset (available from Microsoft’s Github account here) shows that Azure VM users had an an average utilization of just 8.2%, with 72% of users having an average utilization of below 20% (see distribution below). A common pattern uncovered in the dataset was the selection of a small number of powerful, but oversized VMs.

Source: “Using Virtual Machine Size Recommendation Algorithms to Reduce Cloud Cost”, March 2023

Below is an example from a Sedai Azure customer environment of a heavily underutilized instance with just a few percent of CPU being used across a two day period:

This is a customer facing metric. Azure and other cloud providers achieve higher rates internally due to operating shared environments (AWS reported 65% utilization a few years ago). The pricing for shared instances reflect this with dedicated instances costing 250% or more.

Causes of Overprovisioning for Azure VMs

Developer Bias to Overprovision

Effective vertical scaling is also important as developers often default to overprovisioning Azure VM resources, opting for a simpler and quicker setup rather than conducting extensive testing across multiple instance types. This approach, while expedient, typically results in selecting VM configurations that exceed the application's actual requirements, leading to increased costs. The reluctance to engage in detailed testing stems from the time and complexity involved in evaluating each instance type's performance under different workloads. Consequently, developers lean towards a 'better safe than sorry' strategy. Although reducing the risk of underperformance thisinefficiently raises cloud costs.

Bursty Workloads

Many Virtual Machine workloads have bursty traffic patterns, especially for small databases, development environments, and low-traffic websites. For example, in the case below of a dev/test workload CPU utilization stays around 2-4% but surges twice to the 13-15% range.

Given warmup periods may range from a few minutes to a few hours, horizontal scaling may not be viable. Finding suitable burstable instance types may be the preferred approach. In the absence of that, low average utilization will be achieved.

App Architecture Limits Horizontal Scaling

A high proportion of applications running on Azure run directly on virtual machines. These applications have not been replatformed to a microservice architecture such as Kubernetes (including Azure Kubernetes Service (AKS)), or serverless frameworks (Azure Functions). One key reason is that many of these applications do not benefit significantly from the horizontal scaling capabilities offered by microservices architectures. They may have architectural or design constraints that make such a transition complex or suboptimal, limiting their ability to efficiently use newer computing paradigms.

Complex Set of VM Choices

Vertical scaling is also complicated by the many types of VMs available. There are currently over 400 types of Azure Virtual Machine options offering varying Compute, Memory and other characteristics. Below is a chart showing the density of options based on vCPU and Memory size:

Asking an engineer to make the optimal choice across potentially hundreds or thousands of services can be challenging, especially if new code updates change the service's characteristics and then require a new determination of the right instance type.

Importance of Vertical Scaling for Azure VMs

‍

Vertical scaling is particularly advantageous for applications that require high-performance levels from single instances or have dependencies that complicate distribution across multiple servers. By optimizing the configuration of Azure VMs to align closely with actual workload requirements, organizations can ensure that their applications perform optimally without incurring unnecessary costs from overprovisioning. This method allows for more precise control over resource allocation, leading to enhanced performance and reduced expenditures.

Rightsizing Azure Virtual Machines with Sedai

Key Capabilities

Sedai’s Automated Optimization utilizes advanced AI technology to deeply comprehend Azure VM configurations and their impact on application cost and performance. This results in Azure VMs that are optimally sized and configured to meet the specific needs of applications without incurring unnecessary costs or performance issues. Key benefits include:

Cloud Cost Efficiency: Azure VM costs can be reduced by up to 30% or more through optimized resource allocation.
Performance Improvement: Enhance customer-facing services with up to 25% better latency, ensuring a smoother user experience.
Reduced operations effort. Time to rightsize VMs is reduce by up to 90%.

‍

Sedai’s Automated Optimization uses advanced AI that not only deeply understands Azure VM configurations and how they are impacting application cost and performance. This results in VMs that are optimally sized and configured to meet the specific needs of their applications without any excess cost or underperformance.

‍

How It Works

Our AI-driven platform continuously analyzes your Azure VMs to detect inefficiencies. It then autonomously implements optimizations, adjusting resources in real-time without requiring manual intervention.

The Sedai platform operates on a simple yet effective process: Discover, Recommend, Validate, Execute, and Track:

Discover: Sedai first discovers your Azure VM infrastructure and application pattern, going through three steps:
- Identifying the app boundary by looking at traffic patterns (e.g., because they use a common load balancer, or by virtual machine tagging). A set of virtual machines doing the same task and expected to behave similarly can be termed an application. This definition means that a collective action will be able be taken on all the instances of the app.
- Standardizing metrics for optimization. In a heterogeneous fleet, a service may use Node Exporter for Linux, or WMI Exporter or Windows exporter for Windows. It is important that the metrics are labeled correctly such that the system can precisely identify the metrics of a specific application.
- Identifying golden signals to drive optimization. Finding the right signal to listen to can seem like finding a needle in a haystack. Sedai will look for the best golden metrics (latency, error, saturation, and throughput of an application) so that this information can be used in algorithms and machine learning systems such that a recommendation can be generated.
Recommend: then recommends optimal settings based on deep insights into service behavior and dependencies. Recommendations may be provided on a manual basis or occur automatically based on user settings.
Validate: After validating potential changes through multiple safety checks, a sequence of steps so that it could be performed safely on the customer environment:
- Safety check: If there is an action, we need to ask whether this action can be safely performed on that application without risk. If you have a green signal there, you go to the next one.
- Timing check: We see if it is the right time to apply the action, or is there a later preferred time to execute this particular action on this application?
Execute: Once we have a go-ahead for these validation steps, Sedai goes ahead and performs the action.
Learn: After performing the action, we need to figure out if the app is healthy. Updates are also tracked with a full audit trail of changes made to the infrastructure. This step is also important because this allows us to close our learning loop and use this information for further actions.

These capabilities form part of Sedai’s overall Azure VM optimization approach which can be seen below:

Some of the key elements above are:

Access to Cloud APIs which allow Sedai to identify and discover the components of your infrastructure. Sedai’s inference engine actually utilizes this information to build a topology. With this topology information, we deduce the application.
Metrics exporter & Sedai Core. With the information about the application, Sedai’s metric exporter takes the data from all the monitoring providers. With the information about the application and the metrics, Sedai machine learning algorithms can generate optimization and remediation opportunities.
Execution engine. The recommendations are given to execution engine. The execution engine is carefully and cohesively integrated to the platform so that it can utilize cloud APIs to perform the actions on the cloud resources.

Example Azure Rightsizing Cost Savings

Early adopters have seen significant improvements in both performance and cost efficiency. For instance, a healthcare company has identified over $250k of annual savings, a 28% saving, in its dev / test environments through rightsizing using Sedai’s optimization.

Below is an example of the safety check process being performed during a VM resizing. In this case it took 11 steps, most completed quickly but stopping and restarting the VM taking around 30 seconds each:

‍

To gain insights on the state of your VM fleet you can scan it at a glance to see where applications are over or under provisioned as well as optimized based on Sedai’s findings. The the example below 61% of the apps have been optimized (shown as green).

Pricing and Availability

The service is available now, with flexible pricing based on the scale of your Azure VM deployment. Request a demo to see how Sedai can hello you rightsize your Azure VMs.

‍

Thank you for submitting your feedback.

Oops! Something went wrong while submitting the form.

Introducing AI-Powered Automated Rightsizing for Azure VMs

John Jamie

Published on

May 7, 2024

Last updated on

November 21, 2024

Max 3 min