John Organek, Director of Program Planning and Operational Architecture
August 4, 2024
The recent ” software global incident, the costs of which could top $1 billion, points out several glaring gaps and shortcomings in how companies and institutions operate in our brave, new cyber-physical world. And while it did not cause death or injury, it nevertheless, wreaked widespread havoc across other infrastructures, including airlines, hospitals, and 911 services. It caused Delta Airlines alone to cancel more than 2000 flights on July 19 and to cancel over 6,000 flights since then. Something as small as a few lines of bad code deployed to a myriad of endpoints, globally, caused the largest IT outage in history. For want of a nail….the kingdom was lost!
A fundamental error made across the board is the failure to fully understand the risk of apparently minor ‘disturbances’ creating major consequences, whether outbound to or inbound from other infrastructures. One wonders if the Board of any of the companies affected had even considered the devastating impact that software could cause, and if so, did they take the appropriate action to ensure that loss would be mitigated? Did CrowdStrike realize how a bit of bad code would be amplified globally and devastate their reputation as a cyber security company, or did Delta Airlines plan for a scenario of almost existential risk? Did their business continuity plans address such an eventuality and if so, what did they do to address it? After all, software is now a part of virtually everything we touch and do.
Our modern societies comprise other sources of near existential risk beyond software bugs, such as Black Sky electric grid events, widespread communications and data center failures, cyber-attacks, etc. In this highly connected world, very small failures can propagate quickly, leading to other such Crowd Strike incidents in the future.
Preliminary reports pinpoint several failures taking place that led to the outage, casting blame across multiple stakeholders. For example, the new software was insufficiently tested and apparently there was no plan for reverting to the original version. Also, end users were not prepared to act when they lost processing capabilities at the edge. No one seemed to be prepared when the inevitable happened. None of these could be rated as being ‘resilient’.
Software issues are going to continue well into the future. Stakeholders need to recognize that accidents such as the recent one happen normally. They should be therefore especially attentive to the risk, ranging from cyber-attacks to bad quality or poor deployment, that software poses to their business operations and reputation. But because these normal accidents will continue to happen, stakeholders must focus on maintaining business continuity as a top priority, ahead of believing they can fully prevent them from happening. Besides, as Delta has discovered, their operations were gravely affected by bits of software that were developed by a company they probably had little corporate knowledge of.
The CrowdStrike incident has again reminded us of the risks posed by our highly interdependent cyber-physical critical infrastructures. But more importantly, it should remind us that we are still far from being resilient.
Collaboration is our strength.
By: John Organek
Join our membership and
contribution programs:
Participate in our
upcoming events:
Schedule a call with
our experts:
When it comes to preparing for the unexpected, few events offer the insights and expertise that Earth EX Live does. On December 19, 2024, from 11:00 AM to 12:15 PM EST, EIS Council is hosting a groundbreaking event that you can’t afford to miss. This year’s Earth EX Live will be the first to combine […]
Electromagnetic Pulse (EMP) events pose a significant threat to critical infrastructure systems worldwide. These high-intensity bursts of electromagnetic energy can be caused by natural phenomena like solar storms or human-made events, such as a nuclear EMP attack. Given the growing reliance on interconnected and digitalized systems, the need for comprehensive EMP risk assessment has never […]
The concepts of reliability and resilience are often treated synonymously or conflated, leading to painstaking efforts to distinguish between them. While both are complementary and mutually reinforcing, these concepts can produce competing behaviors when it comes to making investment decisions in a resource-constrained environment. Resource allocation decisions need to strike a balance between achieving the […]