Attend a Live Product Tour to see Sedai in action.

Register now
More
Close

CONTENTS

Autonomous Optimization of Amazon ECS at KnowBe4

Published on
Last updated on

April 14, 2024

Max 3 min
Autonomous Optimization of Amazon ECS at KnowBe4

Summary

  • KnowBe4 is the leading provider of security-awareness training and simulated phishing platforms used by over 34,000 organizations globally.
  • KnowBe4 faced an optimization challenge with their Amazon Elastic Container Service (ECS) services, leading them to adopt Sedai's autonomous optimization to reduce toil for engineers and improve efficiency.
  • KnowBe4 implemented a three-part approach (Crawl, Walk, Run) to gradually adopt autonomous optimization, resulting in significant cost savings and performance gains.
  • KnowBe4's autonomous journey has led to 98% of their services running autonomously, with a 27% cost reduction and over 1,100 autonomous actions in the past 3 months.

Introduction 

This article covers KnowBe4‘s experience applying autonomous optimization (both cost optimization and performance optimization) and is based on part of the presentation “Mastering Autonomous Optimization for Amazon ECS” at autocon.  The KnowBe4 portion was presented by Nate Singletary, Senior SRE at KnowBe4.  You can see the full video here, and read the blog covering more detail on the optimization strategies for Amazon ECS behind KnowBe4’s case here.

About KnowBe4

KnowBe4 provides the world's largest security-awareness training and simulated phishing platform.  KnowBe4 is used by over 34,000 organizations globally, and has the world's largest library of security awareness training content.  KnowBe4 ranks at the top of many great places to work lists, and was ranked number one in Energage's top workplaces in the USA.

KnowBe4’s Workloads

KnowBe4’s Amazon ECS workloads support a diverse suite of products ranging from security awareness tools to human detection response and more.   KnowBe4 also integrates externally with their customers’ security stack to provide real-time coaching to their users in response to risky behavior. All these products are spread across thousands of microservices, functions, and data stores.  And these services are all deployed in AWS across several compute and data storage services including Amazon ECS as shown below:

KnowBe4's Tech Stack

KnowBe4’s platform architecture is straightforward. They commit all their code in GitLab.  KnowBe4’s CI/CD runs on GitLab on GitLab runners. Their production workloads are deployed on AWS. Their monitoring, metrics, and alerting is in Datadog (which includes logs and metrics exported from AWS Cloudwatch).  They don't use a wide range of vendors and onboarding new vendors is a rare event for the company.  

KnowBe4’s ECS Optimization Challenge

KnowBe4 saw that they had an optimization “void” post their commit, deploy, and monitoring workflow:

KnowBe4 had ECS services running in AWS, and wanted to ensure they're running efficiently. The challenge they faced was knowing if they were in fact running efficiently. And if they're not, how do they react to that and fix the issue? 

KnowBe4 was using engineers to fill that void. The engineers had to respond to the feedback from the monitoring system including:

  • a service may be running too low in memory or too low in CPU
  • a service may be peaking from a traffic perspective and experiencing performance issues 
  • a service may be running too rich and be overprovisioned

These issues could be impacting customer experience or meaning KnowBe4 was missing out on hundreds of thousands of dollars of cost savings across several services

So the question was how should KnowBe4 fix it?  The KnowBe4 engineers had to commit code to update the ECS config, deploy it, and then wait for that feedback to see if they had rightsized correctly. And this was a continuous process across many services - in fact, thousands of microservices and functions.  This manual process and feedback cycle was not ideal.

KnowBe4 Uses Sedai to Fill the ECS Optimization Void

KnowBe4 decided to use Sedai’s autonomous optimization to fill this void. Sedai allowed them to reduce the toil on their engineers and autonomize the feedback loop of checking what impact a change made in production to critical metrics.

The key drivers for KnowBe4 to move to an autonomous platform architecture were three-fold: 

  • Reducing complexity in managing cloud and reducing toil for their engineers.  
  • Keeping their cloud efficient on a continuous basis
  • Managing availability and performance at the highest levels for their customers

Reducing toil for KnowBe4’s engineers would also allow engineers to focus on the things they like to do - releasing new products and features.

KnowBe4 also wanted to make sure their workloads are running efficiently. That meant keeping up release velocity, while keeping cost at the front of their minds while also ensuring their services were performant.

To achieve this, KnowBe4:

  • Ran their containerized workloads on ECS Fargate. KnowBe4 doesn't need to worry about managing the cluster or the underlying host
  • AllowedSedai to autonomously rightsize their services and adjust their auto scaling.

How KnowBe4 Adopted Autonomous Optimization

To implement autonomous adoption with Sedai, KnowBe4 took a three part Crawl, Walk, Run approach as shown below:

Crawl 

In the first crawl stage, KnowBe4 set up the Sedai integration. They set an initial goal of achieving around 10% cost reduction.

At this stage it gave KnowBe4 the ability to allow Sedai to analyze KnowBe4 workloads, see where KnowBe4 may be overprovisioned, and what the opportunities for cost reduction or performance gains were.

KnowBe4 then enabled autonomous on a set of services.  They were not “diving off the deep end” at that stage as these were low-risk services. KnowBe4’s goal with these services was to see how they reacted to the autonomous optimizations.

Walk

In the walk stage, KnowBe4 had now seen some evaluations. KnowBe4 had seen some opportunities for significant cost reduction and significant performance gains.  KnowBe4 had also seen some realized cost reduction and performance gains in the set of low-risk services that they had enabled.

At this stage KnowBe4 was impressed by the results of autonomous optimization and decided to more aggressively roll out Sedai and decided to turn on autonomous optimization for their flagship products.

Before turning on autonomous optimization, KnowBe4 created groups.  KnowBe4 divided these groups by products and regions and set goals for cost and performance that were tailored to the product. KnowBe4 has some products with services that are more latency tolerant, and set a more aggressive cost reduction goal for that service. KnowBe4 also has services where they need to maintain the highest levels of availability and performance and will not be as aggressive with these services. Once these groups were set up and goals defined, KnowBe4 turned on autonomous optimization for them.

Run

In the Run phase KnowBe4 allowed Sedai to “take the wheel”.  Services are autonomously optimized by default.  If an engineer releases a service in ECS, it's automatically managed by Sedai.

Sedai was integrated across all of KnowBe4’s AWS accounts, and is managing services across all regions.

KnowBe4 is now working towards integrating Sedai into the CI/CD flow so that KnowBe4 will have a fully autonomized feedback loop.

Sedai had now filled KnowBe4’s optimization void.

Realized Savings and Highlights at KnowBe4

Below is an example of the opportunities KnowBe4 has seen in Sedai across a group of accounts. Sedai is projecting a 27% cost savings, or over $400,000 in cloud spend which KnowBe4 considers to be a significant reduction in cloud cost.

Below is an example of an individual cluster with a 36% potential saving.

Below is another example showing some of the realized savings at KnowBe4. This shows some of the Lambda services where KnowBe4 has not only reduced cost by 30%, but also reduced duration and increased performance by 86%.

Highlights from KnowBe4’s autonomous journey

KnowBe4’s highlights now include:

  • 98% of KnowBe4’s 9,491 services now run autonomously
  • 1,100+ autonomous actions in the past 3 months
  • 27% cost reduction, with 10% realized by August 2023

KnowBe4 is now in the process of integrating that back into their CI/CD processes. Once completed they will have a full autonomous workflow. And the IaC would remain the source of truth for KnowBe4’s configs.

Was this content helpful?

Thank you for submitting your feedback.
Oops! Something went wrong while submitting the form.