Loading Now

Mint Explainer | Millions affected: Why cloud crashes cloud AI’s horizon

Mint Explainer | Millions affected: Why cloud crashes cloud AI’s horizon

Mint Explainer | Millions affected: Why cloud crashes cloud AI’s horizon


As companies move massive AI workloads to the cloud, reliance on a handful of large cloud providers called hyperscalers—AWS, Google Cloud, and Microsoft Azure—can make the cloud a single point of failure for critical AI infrastructure. Mint explains why this risk is one software as a service (SaaS) and AI-native companies can’t afford to ignore.

What is the AWS outage all about?

On Monday, Amazon’s cloud computing arm, AWS, suffered a major global outage that impacted thousands of online platforms from social media and gaming to streaming and financial apps not only in North America but also from the UK, Australia, and India.

AWS attributed the downtime to a domain name server (DNS issue) which prevents devices like computers and smartphones from locating websites even though they are still running since the DNS system translates website names (like livemint.com) into internet protocol (IP) addresses that computers can understand.

E-commerce delivery expert ParcelHero estimates retailers across the UK, Europe and the US would have lost around $1 billion because of the global outage. While AWS says it has successfully resolved the issue, such outages raise a bigger concern as AI companies increasingly shift more training (when AI learns) and inference (uses that learning to, say, identify a cat in a new image) workloads to the cloud.

Were there similar big outages in the past?

Major cloud outages have repeatedly compromised the internet. In February 2017, an AWS outage due to an internal human error disrupted Slack and Quora, while Google Cloud experienced big outages in June 2019 and November 2021 that affected Gmail and YouTube too.

Azure experienced similar outages in 2021, but the major one was in July 2024 when a faulty CrowdStrike Falcon Sensor update disrupted 8.5 million Windows devices worldwide, impacting aviation, banking, and government systems.

How dependent are companies and governments on the cloud today?

Just three cloud services providers–AWS, Microsoft and Google–cumulatively service more than 60% of the world’s cloud infrastructure needs. April-June 2025 enterprise spending on cloud infrastructure services increased to almost $99 billion worldwide, up over $20 billion from the second quarter of 2024, as per data from Synergy Research Group. The revenue includes infrastructure as a service (IaaS), platform as a service (PaaS) and hosted private cloud services.

With generative AI (GenAI) being the major driver of this growth, cloud providers have seen their quarterly revenues jump by $36 billion since the beginning of 2023. Amazon remained dominant in the market in the April-June quarter with a 30% market share, followed by Microsoft (20%), and Google Cloud (13%), according to Synergy Research. Small cloud providers include CoreWeave, Oracle, Databricks and Huawei.

But how could AWS alone crash half the internet?

AWS may hold just 30% of the cloud infrastructure market but many globally popular apps, including social media platforms, gaming services, streaming sites, and financial apps (like Alexa, Snapchat, Venmo, Reddit, Coinbase, WhatsApp, Signal, Zoom, and Perplexity), rely on its services.

Hence, when a key region or service fails, millions of users are affected (some airlines, too, like Delta Airlines and United Airlines encountered disruptions, as per Down Detector), regardless of its overall market share. The impact is amplified by the Metcalfe’s law which underscores the network effect: services often depend on AWS application programming interfaces (APIs), databases, authentication, or DNS, meaning that even apps hosted elsewhere can break if they call AWS components.

Additionally, companies tend to consolidate workloads in a few regions or providers for efficiency and cost savings, creating single points of failure and making it seem like “half the internet” is offline.

How reliant is AI on the cloud?

Cloud AI integrates AI with cloud computing, allowing organizations to seamlessly align their day-to-day operational activities with AI tools, algorithms, and cloud services. The global cloud AI market size, which was valued at $78.36 billion in 2024, is forecast to rise from $102.09 billion in 2025 to $589.22 billion by 2032, as per Fortune Business Insights (www.fortunebusinessinsights.com/cloud-ai-market-108878).

The reason is that every major AI breakthrough relies on scaling cloud computing. AI foundation models, including OpenAI’s GPT-5, Meta’s LlaMa, Google’s Gemini and Anthropic’s Claude require cloud infrastructure for their massive computational, storage, and networking needs. AI workloads are the primary driver of infrastructure demand growth, pushing cloud providers toward specialized AI chips, containers, and services.

This is enabling the next platform shift, involving the combination of ubiquitous cloud access and embedded AI capabilities to create entirely new software categories and business models.

How to address redundancy?

Cloud providers do have a comprehensive disaster-recovery framework but many businesses choose “availability within region” and not full multi-region or multi-cloud architecture because of the associated costs and complexity involved. To reduce downtime risks in the AI era, companies must adopt a multi-layered strategy.

Relying on a single cloud provider or region creates single points of failure, so spreading workloads across multiple regions or even multiple cloud vendors can ensure continuity if one service goes down.

Critical systems—databases, APIs, and authentication services—should have active failover and redundancy, with regular testing to confirm that backups work under real-world conditions. Applications should be decoupled from any one service to prevent cascading failures, and designed to offer partial functionality rather than complete shutdown during outages. Continuous monitoring and chaos testing help identify vulnerabilities before they become critical.

However, moving workloads across vendors or regions comes with significant costs, including higher cloud bills, integration complexity, and potential data transfer fees. Companies must weigh these expenses against the risk of downtime, especially as Gen Z users and enterprises alike demand fast, uninterrupted AI services.

Post Comment