Reducing Incidents with Autonomous Cloud Management: 7 Lessons from Autonomous Vehicles

This is a div block with a Webflow interaction that will be triggered when the heading is in the view.

Summary

There are parallels between autonomous vehicles and autonomous cloud management as both use AI to reduce negative outcomes (accidents and incidents respectively)
Reducing human error is crucial in both domains, as human error is a significant factor in incidents involving both vehicle accidents and cloud service disruptions.
AI plays a key role in enabling autonomous systems to detect, prevent, and avoid incidents in both autonomous vehicles and cloud management.
Incident reduction is a journey that involves incremental progress and the introduction of new capabilities to enhance system safety and reliability.
Phased capability rollouts are essential to minimize incidents, ensuring rigorous testing and validation before deployment.
Building user trust is a precursor for adoption, requiring transparency, communication, and a cautious rollout pattern to earn trust and acceptance.

Introduction

The rise of autonomous technologies is transforming multiple industries, from automotive to cloud computing. Autonomous vehicles (AVs) have been at the forefront of this revolution, showcasing how AI based systems can enhance safety, efficiency, and reliability. Similarly, autonomous cloud management (ACM) is poised to transform the way cloud infrastructure is managed by minimizing human error and optimizing operations. This blog explores the parallels between AV development and ACM, highlighting key lessons that can guide the evolution of autonomous cloud management systems.

1. Reducing Human Error is a Key Goal

The Parallel: Human error is a significant factor in both vehicle accidents and cloud incidents, contributing to 67-95% of cases. Pure hardware or mechanical errors are not the main cause of incidents.

Autonomous Vehicles: Vehicle accident statistics tend to find around 90% have human error as a factor (see IIHS-HLDI and IntechOpen). In the automotive industry, mechanical failures account for only about 5% of accidents (IIHS-HLDI). Major accident reductions are expected by tackling human error, with the AAA projecting that safety systems could prevent 37M crashes, assuming partial autonomy only.

Autonomous Cloud: In the cloud, human error remains a significant factor in cloud service disruptions with between 67% and 80% of all downtime incidents over the last 25 years attributed to human error (Uptime Institute), and 82% of cloud misconfigurations stemming from human error (SentinelOne). Hardware failures such as server failures are a minority of incidents (upwards of 20% for on-premises datacenters (EMA); in this analysis by David Mytton, 2% of cloud outages were directly tied to hardware failures).

Implications: Looking to reduce human error through the application of autonomous systems is crucial in both domains. Human error will always be a factor - even with full training and ideal conditions, human errors occur at a rate of around 1 in 10,000 (IntechOpen). The goal is to design systems and training approaches that minimize its impact.

2. AI is the Key Enabler

Parallel: AI is the key technological enabler allowing the progress towards true autonomous systems.

Autonomous Vehicles: AI is a key feature in many subsystems such as lane keeping and autonomous emergency braking. These AI systems primarily pursue the goal of accident avoidance, among other objectives such as efficient navigation, fuel efficiency, and passenger comfort. The underlying technologies include machine vision, deep learning, and neural networks. These technologies enable autonomous vehicles to process vast amounts of sensory data from cameras, lidar and radar to perceive their environment accurately. Additionally, reinforcement learning techniques are employed to improve decision-making and control systems, enabling the vehicle to navigate complex driving scenarios and adapt to new environments based on the vast amounts of data from millions of miles driven (IntechOpen).

Autonomous Cloud: AI is also a key enabler that powers systems that detect, prevent or avoid incidents including resource shortfall avoidance and service restarts. Autonomous cloud systems’ primary objective is incident avoidance, with other objectives being cloud cost and performance optimization. The underlying technologies include predictive analytics and time series models. To perceive the status of the systems being managed, AI systems ingest billions of observations of cloud resource metrics via an APM such as Prometheus, Datadog or CloudWatch. At Sedai for example, we analyze over 1B metrics per day. AI-driven algorithms then analyze that data to identify patterns, predict potential issues, and proactively optimize cloud resources to meet user demands. Autonomous cloud systems also use reinforcement learning to continually improve over time.

3. Incident Reduction is a Journey

The Parallel: The benefits of autonomous systems are realized incrementally as new use cases are tackled.

Autonomous Vehicles: The measurement of progress in autonomous vehicles has been formalized into 5 stages of autonomy (SASE). Individual features at early stages like adaptive cruise control, lane-keeping assistance, and autonomous braking have each contributed to improved safety. For instance, rear-end collisions accounted for approximately 29% of all crashes in the United States (NHTSA) and autonomous emergency braking has been shown to reduce them by up to 50% (IntechOpen).

Autonomous Cloud: In cloud management, progress towards full autonomy is driven by the introduction of new capabilities as well as the coverage of a greater set of cloud services. From a capability perspective, for example, autonomous restarts can significantly reduce incidents. Another example is resource shortfalls, which were the cause of around 12% of incidents (source); Microsoft published results of an AI system that predicted capacity outages with nearly 100% accuracy (source). The second driver, increased coverage of relevant technologies, is an area Sedai is working on; after initially introducing autonomous for modern compute applications (serverless and containers), Sedai is also adding virtual machine, storage and data streaming coverage.

Implications: These staged advancements highlight the importance of progressively integrating autonomous features to enhance overall system safety and reliability.

4. Phased Capability Rollouts Minimize Incidents

The Parallel: Gradual deployment of technology following rigorous testing prevents overreach and mitigates risks.

Autonomous Vehicles: Autonomous vehicles undergo extensive testing in diverse conditions to ensure their safety and performance. For example, Tesla collected over 30 billion miles of fleet data to model and predict collision frequencies and to calculate Safety Scores (source). However, the autonomous vehicle sector has sometimes moved too quickly leading to setbacks, such as the suspension of Cruise's autonomous vehicles by the California DMV due to high accident rates (Global Compliance News). In the data from Ark below, while Tesla and Waymo reported lower accident rates vs the average vehicle, Cruise is estimated to have had higher accident rates.

Autonomous Cloud: At Sedai, we also undertake significant testing on each new release that affects incident management to validate the system's robustness in various scenarios. A significant portion of our cloud costs is driven by these testing processes. This testing helps ensures that autonomous cloud solutions are reliable and resilient before being rolled out in production environments.

5. User Trust is a Precursor for Adoption

Parallel: Building public / end user trust is crucial for the adoption of autonomous systems.

Autonomous Vehicles: In the AV industry, public skepticism with full self driving cars has been a barrier, necessitating efforts to build trust through transparency and communication. Trust among US drivers has dropped in recent years according to the AAA who noted “An incremental rollout of proven technologies will be key to getting drivers more comfortable with the idea of a fully self-driving car”. They note there is strong demand for individuall features, with six in ten US drivers wanting advanced safety features such as autonomous braking, even though just 9% completely trust self driving vehicles (AAA). Companies like Tesla publish safety statistics in an effort to build trust.

Autonomous Cloud Management: Autonomous cloud providers must similarly work on building trust with users by being transparent about the capabilities and limitations of their systems. At Sedai, we contribute to this effort by hosting an annual conference, autocon, where autonomous cloud users share their progress and discuss the progression of autonomous cloud technologies. Trust is also earned based on direct experience, and we usually see a cautious rollout pattern that often involves progressing from less critical environments (development / testing) to more critical ones (production), and from lower to higher levels of autonomy (see an example from AWS user KnowBe4 here). At Sedai, we implement these levels of autonomy through our Datapilot, Copilot, and Autopilot modes, which correspond to collecting insights, manually approved recommendations, and full autonomy.

6. Treat Humans and Machines as a System

Parallel: Integrating human and machine efforts is crucial.

Autonomous Vehicles: In the automotive world, issues have arisen when drivers were unaware of the limitations of new safety systems, leading to misuse and accidents. For instance, drivers may over-rely on features like adaptive cruise control or lane-keeping assistance, assuming these systems can handle all driving scenarios without human intervention. This over-reliance can result in dangerous situations, especially when the system encounters conditions it cannot manage, such as unexpected obstacles or severe weather conditions. Research has shown that effective human-machine teaming in AVs requires drivers to maintain a certain level of situational awareness and readiness to take control when necessary.

Autonomous Cloud: it is vital that users are well-trained on the system and aware of it’s capabilities and limitations. Effective integration of human oversight with machine automation ensures that both can complement each other, reducing the likelihood of errors. For example, while automated management of purchase commitments can help optimize costs, human operators may know the application is going to be replatformed and such a commitment does not make sense.

7. Industry Collaboration Helps as Technologies Mature

Parallel: Clear frameworks, approaches and standards guide the safe deployment of autonomous technologies, and collaboration among industry players can support that objective.

Autonomous Vehicles:Collaboration among automakers, technology companies, and regulatory bodies is crucial in ensuring that autonomous vehicles are developed and deployed in a safe and responsible manner. In the automotive industry, regulations have also been pivotal in shaping the development and testing of AVs. This collaborative approach also helps build trust among consumers and regulators in the safety and reliability of autonomous vehicles.

Autonomous Cloud Management: For ACM, we expect individual players like Sedai to continue building the initial market momentum and that industry associations like the Cloud Native Computing Foundation (CNCF) and the FinOps Foundation will begin to play a crucial role in establishing best practices and standards that help the industry transition to higher levels of autonomy. We also see analysts such as Gartner identifying the best practices around autonomous management and communicating that via their research.

Conclusion

The development of autonomous vehicles offers valuable lessons for the autonomous cloud management industry. By focusing on reducing human error, adopting a phased rollout approach, integrating human and machine efforts, continuously collecting data, rigorously testing systems, establishing regulatory frameworks, building public trust, and fostering collaboration, autonomous cloud management can achieve significant advancements in safety, efficiency, and reliability

Thank you for submitting your feedback.

Oops! Something went wrong while submitting the form.

Reducing Incidents with Autonomous Cloud Management: 7 Lessons from Autonomous Vehicles

John Jamie

Published on

May 27, 2024

Last updated on

November 28, 2024

Max 3 min