fbpx

CrowdStrike and the Need for Resilience

John Organek, Director of Program Planning and Operational Architecture

 August 4, 2024

The recent ” software global incident, the costs of which could top $1 billion, points out several glaring gaps and shortcomings in how companies and institutions operate in our brave, new cyber-physical world. And while it did not cause death or injury, it nevertheless, wreaked widespread havoc across other infrastructures, including airlines, hospitals, and 911 services. It caused Delta Airlines alone to cancel more than 2000 flights on July 19 and to cancel over 6,000 flights since then.  Something as small as a few lines of bad code deployed to a myriad of endpoints, globally, caused the largest IT outage in history.  For want of a nail….the kingdom was lost!

A fundamental error made across the board is the failure to fully understand the risk of apparently minor ‘disturbances’ creating major consequences, whether outbound to or inbound from other infrastructures. One wonders if the Board of any of the companies affected had even considered the devastating impact that software could cause, and if so, did they take the appropriate action to ensure that loss would be mitigated? Did CrowdStrike realize how a bit of bad code would be amplified globally and devastate their reputation as a cyber security company, or did Delta Airlines plan for a scenario of almost existential risk? Did their business continuity plans address such an eventuality and if so, what did they do to address it? After all, software is now a part of virtually everything we touch and do.

Our modern societies comprise other sources of near existential risk beyond software bugs, such as Black Sky electric grid events, widespread communications and data center failures, cyber-attacks, etc. In this highly connected world, very small failures can propagate quickly, leading to other such Crowd Strike incidents in the future.

Preliminary reports pinpoint several failures taking place that led to the outage, casting blame across multiple stakeholders. For example, the new software was insufficiently tested and apparently there was no plan for reverting to the original version. Also, end users were not prepared to act when they lost processing capabilities at the edge. No one seemed to be prepared when the inevitable happened. None of these could be rated as being ‘resilient’.

CrowdStrike “Falcon Sensor"

CrowdStrike “Falcon Sensor”

Software issues are going to continue well into the future. Stakeholders need to recognize that accidents such as the recent one happen normally. They should be therefore especially attentive to the risk, ranging from cyber-attacks to bad quality or poor deployment, that software poses to their business operations and reputation. But because these normal accidents will continue to happen, stakeholders must focus on maintaining business continuity as a top priority, ahead of believing they can fully prevent them from happening. Besides, as Delta has discovered, their operations were gravely affected by bits of software that were developed by a company they probably had little corporate knowledge of.

The CrowdStrike incident has again reminded us of the risks posed by our highly interdependent cyber-physical critical infrastructures. But more importantly, it should remind us that we are still far from being resilient.

We are all connected. We are all vulnerable.

Collaboration is our strength.

By: John Organek

Create Impact with us:

Join our membership and
contribution programs:

Get involved >>

Participate in our
upcoming events:

Events >>

Schedule a call with
our experts:

Consult >>

Our upcoming events:

image

Geomagnetic Disturbance: a Planetary Disaster Risk

John Organek, Director of Program Planning & Operational Architecture, EIS September 9, 2024 The Problem: Geomagnetic Disturbances (GMDs) represent a significant threat to our critical infrastructures, particularly in the electric, communication, and fuel sectors. As electricity and electronic systems play an ever-expanding role in our daily lives, the risk posed by these disturbances grows. GMDs, […]

Learn more

The Role of Product Management in Developing Crisis-Ready Infrastructure

Gil Keini, Head of Product, EIS August 19, 2024 In an era marked by increasing uncertainties and the growing threat of extreme events, developing crisis-ready infrastructure is no longer optional but necessary. As product managers, we are at the forefront of this endeavor, ensuring that the products developed are functional, resilient, secure, and of the […]

Learn more

Fueling Passion and Purpose: Staying Engaged to Prevent Burnout on the Job

Gil Keini, Head of Product, EIS August 6, 2024   In the fast-paced world of hi-tech, burnout can often seem inevitable. However, staying deeply engaged and connected to the project’s outcome can be a powerful antidote. At EIS Council, our work on the Humane Continuity Project is not just about meeting deadlines or achieving milestones; […]

Learn more
image