Attend a Live Product Tour to see Sedai in action.

Register now
More
Close

CONTENTS

Cutting eCommerce Latency with Autonomous Optimization

Published on
Last updated on

June 17, 2024

Max 3 min
Cutting eCommerce Latency with Autonomous Optimization

Introduction

fabric, is a modular and headless platform tailored for modern eCommerce applications, with a primary focus on Software-as-a-Service (SaaS) solutions. fabric's mission is to empower retailers to streamline their commerce operations across various channels by creating custom functions through interconnected components. Gone are the days of relying solely on prepackaged systems; they embrace a more flexible and adaptable approach to meet the evolving needs of their clients.

Prakash Muppirala, the head of Platform Solutions at fabric spoke at the autocon conference regarding the problem statement at hand and shed light on the paramount importance of performance in their operations. They all understand that in the competitive world of eCommerce, every second counts, and delivering a seamless user experience is the key to success. Prakash will delve into the key challenges that currently pose obstacles to their progress, providing us with a deeper understanding of the intricacies involved. With nearly a year of experience in this remarkable company, Prakash brings a wealth of knowledge and expertise to the table. You can watch the full video here.

The Impact of Latency and Performance Factors on eCommerce Success

You might have come across some of these numbers before when looking at the history of eCommerce in general. These numbers are often associated with significant events like Black Friday, Cyber Monday, or peak seasons for retailers such as Amazon, online traffic management platforms like Akamai, and retailers like Walmart. Back in 2012, Amazon emphasized the importance of latency, stating that a 100-millisecond delay could significantly impact revenue. Initially, this may have been relevant only to Amazon, but now it has become a critical factor for all eCommerce businesses. With the COVID pandemic, the shift to online shopping and the continuous growth of the online market, performance factors like velocity, availability, scalability, and performance have all become equally vital.

Over the past decade, studies have consistently shown that the longer visitors wait for a website to load, the higher the bounce rate, which refers to the number of users leaving the site without taking any action. This directly affects conversion rates, brand perception, and revenue. Amazon has set the standards for rapid site delivery, especially when it comes to dynamic content and calling multiple services. Most of the top 20 retailers have adopted some form of microservices or high-availability environments across various channels to meet customer expectations.

Having a high level of autonomy is crucial when considering how all these services come together. The online experience has undergone significant changes in the last decade, largely influenced by Amazon's leadership in customer experience, navigation, pricing, shipping, and more. As Amazon paved the way for themselves and their marketplace merchants, other retailers had to catch up and find solutions. They not only had to address their business and functional problems but also tackle the operational and infrastructure challenges. Catching up with Amazon, who is also a leader in the cloud operating environment with AWS, proved to be extremely challenging for retailers.

Delivering Customer Satisfaction and Managing a Complex Infrastructure in eCommerce

In the realm of eCommerce, customers place a significant reliance on them as they are their secret sauce, their bread and butter. fabric empowers them to operate successfully. Among their key customers are renowned names such as McDonald's, GNC, Honest Company, NBF, and several others. It is crucial for them to establish and maintain the trust of these customers, especially considering the scale at which they operate. As you may have read in previous articles, ensuring customer satisfaction is of utmost importance to them. Having highly satisfied customers, as depicted on the slide and discussed earlier, not only fosters loyalty but also attracts new customers to their platform.

In their present operating environment, they manage a substantial number of services, ranging from 300 to 500. When considering all of their customers collectively, this number reaches close to 10,000. The application code is written in various languages, including Java, Python, and JavaScript. They primarily operate in a serverless environment on AWS, although they are gradually transitioning towards ECS (Elastic Container Service) and EKS (Elastic Kubernetes Service) to achieve a hybrid infrastructure. Currently, their latency, specifically the 90th percentile (P90), stands at over 400 milliseconds.

Optimizing Funnel Performance and Operational Scalability for eCommerce Success

Over the past 24 months, they have experienced rapid growth as they onboarded more customers onto their platform. As you may be aware, in the eCommerce space, customers typically navigate through a series of steps on a website, starting from the homepage and moving through the search results page, product detail page, cart, checkout, and post-transaction management. When examining the typical pattern of an operating environment for such a process, they observe that around 80% of the traffic focuses on the top of the funnel, encompassing activities from the homepage to the cart page. This is where a lot of user activity and traffic occur, involving heavy read-only operations and frequent access to the same set of data, such as product information, inventory, and pricing.

On the other hand, the bottom of the funnel, which includes the shopping cart, checkout, and post-transaction management, needs to be highly performant. While the traffic volume is relatively low, the responsiveness of these processes is critical as it directly impacts conversion rates. During the initial launch of some of their customers, they noticed latency issues affecting shopping cart and checkout performance, which in turn affected conversion rates. Additionally, the post-transaction experience involves the orchestration of multiple microservices, including calls to product information, pricing, inventory, customer data, coupons, third-party services, loyalty management, tax, payments, shipping, address certification, and more. These services may be hosted externally, requiring effective coordination for optimal performance.

To ensure a highly performant shopping cart, it needs to be resilient and provide a response time of around 500 milliseconds to one second. Amazon, for instance, has taken this a step further with their one-click checkout feature, which seamlessly performs all the necessary actions in the background to enable a swift order placement. This level of functionality has become the norm in the industry. Now, the challenge lies in providing the same level of performance and functionality for every retailer.

While functional orchestration is essential for delivering a seamless customer experience, they recognize that the operational aspect is equally important. As a company, they need to ensure that their operating model is not only their core competence but also scalable to meet the needs and demands of their customers and retailers. They prioritize the customer in their zero-tolerance operating model, putting their needs first

Key Challenges 

When optimizing these services in real-time across a vast network of 10,000 services, a trade-off between cost and latency becomes crucial. While it is technically possible to keep all services running continuously for each customer, doing so would significantly impact costs. Therefore, striking a balance between cost efficiency and maintaining acceptable latency levels is important.

Another factor to consider is the dynamic nature of supply and demand in the retail environment. Seasonal fluctuations, such as back-to-school promotions or the high-volume demand during the Q4 period, including Black Friday, can significantly affect the workload on the system. Managing this fluctuating demand efficiently is essential to ensure optimal performance and customer satisfaction.

Lastly, ensuring high availability and handling timeouts is critical. Despite having a robust operating environment, there will always be instances where certain applications or services may experience degradation or encounter timeouts. The ability to gracefully handle these situations and keep the core functionality running smoothly is vital. A relevant example is managing daily deals. These deals often involve high transaction volumes and require ensuring that the inventory is available for customers to purchase. Mapping out and addressing such scenarios effectively is equally important

When optimizing for latency versus cost, it is essential to find a happy medium that ensures efficient latency while keeping costs under control, specifically focusing on serverless costs. In their typical environment, leveraging eCommerce trends, they observed that the majority of traffic occurs within a specific timeframe. For example, in the North American landscape, around 12 to 15 hours a day account for the highest traffic volume, while the rest of the time experiences minimal traffic. These peak hours of a typical day contribute to approximately 90% of the total traffic for most retailers, with significant degradation during off-hours.

To strike a balance between cost and speed, it becomes crucial to leverage a serverless environment during these peak hours, optimizing for high-volume periods. By doing so, they can efficiently manage costs while ensuring optimal performance for their customers.

Another crucial aspect is matching supply to demand. Considering the variability in traffic, whether it's based on the time of day or the seasonal fluctuations throughout the year, it becomes vital to avoid cold starts and ensure the capacity to handle high-volume periods. Additionally, accurately predicting the expected volume and appropriately sizing the infrastructure to meet that demand is equally important.

In eCommerce environments, effectively managing the supply and demand dynamics is critical for maintaining optimal performance and delivering a seamless customer experience.

Comparing Options for Optimization

Let's explore potential solutions to address these challenges. They have mapped out several ideas in a matrix, considering different options for optimization. These options range from 100% manual approaches to semi-automated solutions and third-party semi-automation. Many technology companies initially embark on a "do it yourself" (DIY) approach, at least in the early stages of their journey. Alternatively, if autonomy is your core competence, you can choose to go fully autonomous with a DIY approach.

To evaluate these options, they consider factors such as velocity, safety, time to value, cost, focus on core competency, and supported platforms. As you can observe, as they shift from left to right on the matrix, there is a heavier emphasis on autonomy, as it has the potential to bring together all these factors effectively. This progression from manual to semi-automation and eventually to autonomy is a natural evolution for many startups, particularly in the technology sector. Typically, they start with manual processes, gradually automate certain aspects, involve third-party solutions to replace existing capabilities, and then build their own modules for the aspects they consider their "secret sauce."

Continuing on this journey allows companies to continually improve and optimize their operations while leveraging the strengths of both internal and external resources

Unlocking Efficiency and Innovation through Autonomy in eCommerce Operations

When it comes to autonomy, it has consistently proven to be faster, better, and more cost-effective, especially for tasks that are outside of their core competence. In particular, if they consider the process from problem detection to recommending a solution, implementing the recovery, validating the solution, and finally launching it into production, autonomy plays a significant role. By embracing autonomy, they can streamline these processes and achieve notable efficiencies.

In a manual environment, executing these tasks can be time-consuming, whereas a semi-automated approach brings some savings. However, by embracing autonomy, they can truly focus on their core competence and tackle the real problem at hand. Autonomy allows them to leverage advanced technologies and automated systems to accelerate these processes, improve the quality of solutions, and achieve cost savings.

By prioritizing autonomy, they can ensure a faster, more efficient, and cost-effective approach that aligns with their core strengths and expertise.

The Journey of Autonomous

In August of last year, when they launched some of their  initial customers, they observed a significant improvement in latency across a few specific services that they had prioritized. Instead of implementing these changes across their entire operating environment, they focused on the most critical aspect, which is shopping cart conversion. This included optimizing the latency for the shopping cart, checkout process, and all the navigation steps leading up to order creation. They specifically targeted four or five services for improvement.

Upon implementing these optimizations, they witnessed a considerable reduction in latency and started contemplating their next steps. The email on the right side of the slide highlights their communication at that time, discussing the progress they had made and the improvements they had achieved.

This experience prompted them to further evaluate and refine their strategies, considering how they could replicate these latency reductions across other aspects of their operating environment to enhance the overall customer experience.

When they analyze the environment before and after making changes, they can observe the impact of optimizing the environment, particularly in reducing cold starts and improving performance. They focused on optimizing Lambdas that powered core capabilities, specifically related to first transaction management and quarter-to-quarter performance. Over the course of nearly a year, they have implemented various autonomous actions to enhance the entire commerce environment for multiple retailers that they have launched.

As a result of these efforts, they have experienced significant latency reductions. Their ability to detect issues, respond quickly, recover efficiently, and move forward has improved substantially. They have been able to achieve proportional reductions in latency within a matter of days. These advancements have enabled them to provide a more seamless and efficient experience for their customers and retailers alike.

Unlocking Efficiency and Continuous Improvement through Autonomy in Production Scaling

In terms of cost, particularly with regard to the shopping cart, one of my team members expressed the following sentiment: "When theyinitially observed latency issues before implementing Sedai, they noticed a remarkable improvement once it was activated. This resulted in a smoother customer experience and increased retailer satisfaction. Importantly, these positive changes were achieved without any negative impact, and they even potentially saved costs on cloud spending."

The implementation of Sedai not only addressed the latency issues but also had a positive impact on both customer satisfaction and retailer satisfaction. Furthermore, it demonstrated cost-saving potential by optimizing cloud spending without compromising performance.

As they began scaling autonomously in production, they recognized the challenges of manual processes in addressing issues. Typically, resolving problems would involve extensive communication with retailers, identifying the specific issues impacting their search or checkout experiences, and engaging various stakeholders from application domain teams. These manual processes required significant time, effort, and coordination, with latency, cost, and time being key considerations.

With the implementation of their solution, autonomy brought numerous benefits. It allowed us to detect and recover from issues in a 24/7 model, leveraging automation to enhance throughput and availability for all applications. A notable advantage was the proactive prevention of timeouts by identifying potential issues beforehand. Moreover, the solution facilitated a continuous learning experience, providing insights during runtime and enabling analysis of the impact of new code deployments on the operating environment. This feedback loop allowed us to swiftly comprehend any degradation and share insights with their developers, leading to the prevention of issues and the improvement of key metrics over time. Empowered by the solution, they could regress and analyze crucial factors, enabling developers to understand how feature deployments influenced the operating environment. This iterative growth process played a vital role in their ongoing improvement and success.

Feedback Loop

As I mentioned earlier, the feedback loop is absolutely crucial for us. It plays a vital role in establishing a strong level of confidence among their developers, as well as their DevOps and SRE teams. By continuously improving their release confidence, they are able to accelerate innovation, prioritize their core competencies, and concentrate on developing applications that address key challenges. With the autonomous environment taking care of the more routine tasks, they can focus their efforts on creating commerce services that provide functional value to their customers, rather than being overly consumed by the intricacies of the operating environment. This approach enables us to deliver greater customer satisfaction and drive meaningful progress in their domain.

In general, they have observed improved efficiency over time. Initially, their SRE team consisted of two to three members. However, they have transitioned to a hybrid model with a small number of individuals who can effectively handle a larger customer base. Despite experiencing significant growth in the past 24 months, they have managed to maintain a compact team that handles the same operating environment. The primary focus of their SRE team has shifted towards managing the key application, interconnectivity, and orchestration, rather than solely focusing on computing and other aspects. they rely on experts and autonomous systems to handle operational management

Driving Success through Autonomy Implementation: A Year of Achievements

Let's briefly highlight what occurred last year. When they brought Sedai on board, they initiated a pilot program that took only two weeks to complete. They observed some autonomous actions in their sandbox and pre-production environments, which paved the way for their transition to the production phase. Within a timeframe of two to four months, they successfully enabled end-to-end functionality for one of their customers, starting from post-transaction activities to a fully functional eCommerce environment. This progress extended to achieving 100% coverage for other customers as well, including more advanced features. Currently, they have over 10 accounts in production.

Their experience with embracing autonomy has revolved around key themes. Firstly, they have seen a significant reduction in latency, which is crucial for their operating environment and their customers. Secondly, by transitioning from manual processes to leveraging autonomy in a hybrid environment, they have increased their ability to serve a broader customer base. Lastly, autonomy has fostered innovation by providing valuable insights into their operating environment, helping us improve it. It has also facilitated smoother release processes for their developers and DevOps engineers. Let's move on to the next slide. Oh, that concludes the presentation. This example effectively showcases how they have implemented autonomy within their company and environment over the past year or so.

Q&A

Q: You've shown several different advantages or benefits that you've reaped from going autonomous. In terms of really hard ROI, what are the things that you have seen that have been a benefit and what advice might you give on that on looking at the hard ROI for other folks here as well?

A: Prakash Muppirala: I think the most important benefit for us is to keep our customers happy. When you take that back into what ROI means specifically to autonomy, I think it's when keeping our operating environment for our customers, they rely on fabric to power their entire business. eCommerce is primarily their livelihood, I think. If you pay online, it's 100% of their livelihood. If it's the retailer who has retail and online, it is definitely a significant portion of their livelihood. Keeping that operating environment high and alerting when issues happen and more importantly sizing to the seasonality that could come in is a big ROI by itself. The next step is really about understanding our environment and optimizing our operating environment and helping our SRE team focus more on our key applications versus the compute is another ROI which can be surfaced in terms of number of people and the ratio of customers to SRE engineer, the efficiency that you can bring in both from customer satisfaction to the internal operating environment as well.

Q: What do you really see as the role of the SRE when there was an autonomous system also in place? You hinted a little bit on that, but how do you think the job of the SRE team changes?

A: Prakash Muppirala: It's important to look at the SRE. There's always going to be playbooks. Even in a 100% autonomy world, in some of the previous sessions, they look at what happens when something goes wrong, and the ability for SRE to really investigate, because eCommerce is complicated when it comes to orchestrating a lot of services. RSRs are very functional in domain knowledge, as much as our engineers. It's important to understand really the business problem and how to react quickly to the business problem versus focusing more on memory or the other aspects of operating environment. It's to really understand how to recover the business than recovering the operations itself, because theyleave that autonomy and the SREs focus more on the business problems.

Was this content helpful?

Thank you for submitting your feedback.
Oops! Something went wrong while submitting the form.