Attend a Live Product Tour to see Sedai in action.

Register now
More
Close

CONTENTS

Amazon ECS Optimization Challenges 

Published on
Last updated on

May 13, 2024

Max 3 min
Amazon ECS Optimization Challenges 

Summary

  • Overprovisioning is a significant issue in ECS, with about 65% of containers wasting at least 50% of their CPU and memory resources, leading to nearly a billion containers' worth of compute resources being wasted weekly.
  • Teams often overprovision due to uncertainty about their needs, aiming to ensure availability and performance during peak times or special cases, despite this leading to unnecessary costs.
  • Effective management of ECS and cloud costs overall can significantly affect a company's financial performance. For example, improving cloud cost efficiency by reducing overprovisioning could potentially enhance a company’s bottom-line profit by 17%.
  • ECS optimization also impacts revenue through application performance and availability. A mere 100ms increase in latency can equate to a significant revenue loss, highlighting the critical nature of performance in ECS services.
  • Amazon ECS offers various controls for optimization, including service rightsizing and spot instance usage. These, along with continuous adjustments to respond to changing traffic and application demands, present ongoing challenges and require complex solutions involving both manual and automated strategies.

Cost Challenges

ECS Overprovisioning is a Major Problem

Every year Datadog creates a report on container usage;  we’ll focus on their report from November 2023. Datadog says 65% of containers waste at least 50% of CPU and memory. That means half of all container resources are being wasted. So if two billion containers are spun up each week, almost a billion containers worth of compute resources are wasted, according to the report.

Similar reports from other providers support a staggering level of overprovisioning. Most report waste of at least 50% of CPU and memory.

So the industry has a major problem - everybody overprovisions their compute. 

Reasons for Overprovisioning in Amazon ECS

Why do teams overprovision? Most simply don't know what to provision for.  Due to this uncertainty, they simply overestimate their needs. The logic is “Let me put double or triple the capacity I think I need so that things are OK when I hit my seasonality peaks and my special use cases”. 

The three most common situations we see driving overprovisioning are:

  1. Developers solve for availability/performance, not cost when setting configuration to reduce the risk of deployment failure. Developers may add additional CPU and Memory, and additional replicas. Engineering teams focus on releasing features. They're not focused on what runs in production. So they just keep it safe, building in a standard way and overprovisioning.
  2. Application owners also default to overprovisioning to reduce performance and availability risks for services under their control
  3. Organizations may use “standard” or “t-shirt” compute capacity sizes for applications that have different needs e.g., Using a compute profile of 4 CPU & 8 GB memory across 100 services, while another compute profile of 2 CPU & 32 GB memory might be used across another 50 services.

Managing discounts 

Below are the top priorities of FinOps professionals in the 2024 State of FinOps survey.  While overprovisioning is the #1 priority, it is closely followed by managing discounts.

FinOps Professionals Priorities

Successful management of discounts can deliver double digit reductions in spend, but can be difficult to manage due to the need to avoid over committing to overall spend amounts and specific resource types.

Impact of Cloud Costs on Company Financial Performance

Overprovisioning and discount management are not only important to engineering budgets but can be important to company-wide financial performance. Below is a simplified Profit & Loss (P&L) statement for a public security SaaS company. For every $100 million in revenue, approximately 8.6% was spent on cloud costs which is at the high end. If the company could save 20-30% of that 8.6%, bringing down cloud costs to 6% of revenue, the company's bottom-line profit would increase 17%.

Revenue Impact of Performance and Availability

In addition to cost impacts, the effectiveness of ECS optimization can also affect revenue via the performance and availability of applications if the ECS services play an important role in end customer experience.  The need to avoid outages is widely understood, but performance slowdowns can have the same impact as major outages.  In an ECS context, latency can be a silent killer if small delays across hundreds of microservices add up to a material impact on user experience.

The example below shows that for a business unit with $100M annual revenue, a 100ms slowdown running across the course of a year has the equivalent impact to an extended 88 hour outage. 

Relative Impacts of Outages and Slowdowns

This example assumes a 100ms slowdown for users translates to 1% lost revenue.  This assumption is based on an early Amazon.com finding.  Below is that finding and a series of others:

Findings on Latency Impact on Traffic & Revenue

The overall importance and timeframe of impact will vary by business (e.g., immediate drops in revenue can occur in ecommerce, SaaS impacts would be slower and tied to contract cycles).

ECS Optimization Challenges

Let's jump into some of the challenges that ECS users face:

  • Multiple goals to be addressed
  • Many controls to be optimized
  • Constant change in inputs 

Multiple Goals

Let's jump into some of the challenges that ECS users face:

  • Cost, ensuring that ECS services meet their functionality at the required availability and performance with the lowest cost.  Cost efficiency is driven by both engineering (e.g., rightsizing instances) and financial (e.g., savings plan) optimization.
  • Performance, which involves ensuring that the application meets latency requirements e.g., for a web application, page load times are under a given threshold time so that end users do not experience delays.
  • Availability, or ensuring that requests to ECS services can be served by the application.  Historically time based metrics (uptime) were used but in a microservice environment request based metrics such as FCI rate (Failed Customer Interaction rate) can be more effective.

Optimizing each service with respect to all three objectives can be challenging.  In this guide we will look at the use of SLOs to allow us to break down this problem.  SLOs can help us then approach this problem as minimizing cost, subject to meeting performance and availability needs.  We’ll also look at whether workloads can afford to have lower availability thresholds which can allow the use of lower cost spot instances.

Multiple Controls for Amazon ECS Optimization

Amazon provides multiple controls to optimize Amazon ECS for cost and performance:

  • Service Rightsizing: You can rightsize your ECS services, either vertically (increasing or decreasing the amount of CPU and Memory for a given task or service), or horizontally by modifying the number of tasks running. 
  • Instance Rightsizing: You can adjust your cluster by managing your container instances if you use an EC2-backed cluster. Key controls include the number of instances, instance types and auto scaling groups (ASGs).
  • Purchasing Commitments (excluding spot): You could use the discounted purchasing commitments that Amazon has given you including Savings Plans and Reserved Instances. 
  • Spot Instances: Spot is one of several pricing models for Amazon ECS offered by Amazon. They are a good fit for fault-tolerant use cases. Most commonly this is the case in development or pre-production environments where you may be able to use Spot because even if the environment is down for a few minutes, the impact is minimal. Even in production, if the application is fault-tolerant, you may be able to use Spot for stateless services. Spot options include EC2 and Fargate Spot.

It can be challenging to configure all these controls to meet our goals and continually adjust them.   Later in the guide we’ll go through solutions to this challenge including manual, automated and autonomous approaches.

Constantly Changing Inputs 

We also need to look at how we’re performing, how various settings on these controls perform with different application inputs, which include: 

  • Traffic: We need to look at the amount of traffic that's inbound, and its seasonality (e.g., across the course of the day, on different days of the week, etc). 
  • Releases: New versions can change application cost and performance.  For example, a more advanced recommendation function is added which slows performance unless more compute is added.

Late in the guide we’ll look at how adaptation is possible under manual, automated and autonomous approaches.

Large Set of Metrics Needed

To then relate these inputs and controls to our goals, you need to look at a series of metrics including:

  • CPU utilization and memory utilization (which drive cost)
  • Performance or latency

We now have all these complex combinations to look at, even to optimize only one application to achieve the best resource utilization. Many organizations have thousands of applications to manage, so the complexity grows significantly. Managing 100 microservices that are released weekly involves analyzing 7,200 combinations of metrics as shown below. This includes six performance metrics, various traffic patterns, and four monthly releases. Optimizing each service is a complex task that requires careful analysis and monitoring to ensure smooth operation.

Just imagine the challenge of optimizing a fleet of 2,000 microservices, compared to just 100. The sheer number is mind-boggling. Constantly optimizing such a large system on a daily basis is a daunting task.  Managing and optimizing a fleet of this size is nearly impossible for any human. This highlights the challenges and complexities involved in the ongoing cost optimization & performance optimization processes.

Was this content helpful?

Thank you for submitting your feedback.
Oops! Something went wrong while submitting the form.