The CrowdStrike Outage: A Crucial Business Continuity Exercise
The recent CrowdStrike outage that disrupted computer systems worldwide serves as a stark reminder of the importance of robust business continuity and incident response strategies. For any technology-dependent or information-centric industry
such as manufacturing, financial services, or retail-related industries, this event is a valuable case study highlighting the critical need for preparedness and resilience.
Understanding the Impact
CrowdStrike, a leading cybersecurity firm, caused a significant outage in Microsoft Windows Computers that affected services for thousands of organizations globally. For organizations relying on CrowdStrike for their security needs, the outage, caused
by a faulty application update, posed immediate operational and security risks. Systems that utilized CrowdStrike’s protection products were abruptly taken offline, often in an unsafe way (they “crashed”). Systems that depended on CrowdStrike's
protection were left vulnerable, and businesses had to quickly adapt to mitigate potential security threats while also dealing with often catastrophic levels of systems disruption.
The fallout from this incident underscores a fundamental truth: no organization is immune to disruptions, even those specializing in cybersecurity. Remediation of the underlying cause of the software flaw required manual intervention by technology systems administrators, largely on a system-by-system basis. It was arduous, mundane, and extremely time-consuming work. These factors prolonged the time to recovery for most organizations impacted by this incident.
The Importance of Business Continuity Planning
Business continuity planning (BCP) is the process of creating systems of prevention and recovery to deal with potential threats to a company. The objective is to enable ongoing operations before and during an adverse event or disaster. The CrowdStrike outage illustrates why BCP is essential for all organizations, especially those in information-centric industries where data integrity and availability are paramount.
A well-designed BCP should address the following:
- Risk Assessment: Identifying potential risks and their impact on business operations.
- Business Impact Analysis: Determining the effects of an interruption on critical business functions.
- Recovery Strategies: Developing strategies to recover critical business functions within a defined timeframe.
- A Documented Plan: Documenting recovery strategies and procedures.
- Testing and Exercises: Regularly testing the plan to ensure its effectiveness. Keep in mind not every contingency and planning consideration need be a natural disaster, fire, or other catastrophic calamity. The CrowdStrike outage is a good example of vendor-introduced risk.
Incident Response: The First Line of Defense
Incident response (IR) is a structured approach to addressing and managing the aftermath of a security breach or cyberattack. The goal is to handle the situation in a way that limits damage and reduces recovery time and costs. The CrowdStrike outage is a perfect example of why having a robust IR plan is crucial. Organizations must be able to detect incidents quickly, respond effectively, and recover operations efficiently.
The CrowdStrike outage quickly swept across the globe, taking down systems in its wake. System administrators were scrambling to determine if the outages they were suddenly experiencing were the result of a cyberattack or something else entirely. Incorporating these types of lessons into a documented Incident Response Plan (IRP) can ease “fog of war” decision making in the middle of a fast-moving incident.
Lessons From the CrowdStrike Outage
The CrowdStrike outage provides several key takeaways to consider incorporation into documented Incident Response Plans and Business Continuity Planning.
- Vendor Dependency: While relying on vendors for specialized services is common, it’s crucial to understand the risks involved. Businesses must have contingency plans in place for when these vendors face outages. Asking questions
involving key vendors deployment, testing, and incident notification procedures can uncover unexpected (latent) areas of risk. - Communication: Effective communication during an incident is vital. Keeping stakeholders informed about the situation and the steps being taken to resolve it helps maintain trust and manage expectations. Communication intra-organization and inter-organization is critical. Many of the organizations impacted by the CrowdStrike outage found themselves without key communication systems, like Voice over IP and email, at the onset of the incident. Equally, many found themselves without access to critical documentation and process management systems. Leaving them unable to contact appropriate resources. Many literally didn’t know who to call or had the ability to do so. The use of alternative systems that don’t share dependencies (or cloud platforms) with production-critical systems can minimize these types of unforeseen impacts. Often the simplest solution is the most effective, like a printed emergency contact list.
- Resilience and Redundancy: BCP best practices state that organizations should implement redundant systems and backup solutions to ensure business continuity in case of a vendor outage. In the case of the CrowdStrike outage, this would be difficult – but not impossible – to accomplish due to the widespread and critical nature of the cybersecurity services CrowdStrike provides. Rather than writing off the inability to provide resilience or redundancy to such critical (and “low-level”) vendors like CrowdStrike, consider applying more scrutiny to the practices of those vendors. Technical and administrative solutions exist to minimize exposure to single-vendor risk. It’s the organization’s choice to employ them.
Examining Post-outage Preparedness
Now, four days after the CrowdStrike outage, any vendor or customer still in a crippled state should raise a red flag. Organizations still crippled should urgently review their incident response and business continuity strategies. A prolonged recovery period indicates weaknesses in their preparedness plans. Organizations should conduct a thorough post-incident analysis to identify gaps and implement improvements. This may include contractually secured emergency staff augmentation or better pre-incident tabletop exercises. Whatever the modality, use this as a wake-up call. It isn’t a question of “If” there’s a next “CrowdStrike Outage”, it’s a matter of when.
Moving Forward
For organizations sensitive to unexpected interruptions to their technology systems, the CrowdStrike outage is a call to action. It’s an opportunity to reassess and strengthen your organization’s business continuity and incident response plans. By learning from this incident, you can better prepare your organization to handle future disruptions, ensuring resilience and maintaining operational integrity. Remember it is a matter of “when not if”.
In conclusion, the CrowdStrike outage serves as an excellent business continuity exercise. It highlights the need for robust planning, effective incident response, and ongoing resilience in the face of disruptions. Organizations must take this opportunity to enhance their strategies and ensure their organizations are well-prepared for any eventuality.
Getting Help
CorpInfoTech is a managed service provider (MSP) that offers IT and security solutions to SMBs seeking to protect their business assets from external and internal cyber threats. Our services include firewall management (xDEFENSE), vulnerability scanning (v360), patch management, security/risk assessments, and compliance aid. Our services exist to make sure your business is both operational and secure 24x7.
CorpInfoTech is here to help organizations conduct planning, analysis, preparation, and testing of Business Continuity strategies in the event of an outage or unforeseen security event.
Contact us today to learn how we can better prepare your business for the future!