Preventing Critical Errors from Reaching Production: Lessons from the CrowdStrike Outage
On July 19th, 2024, the world witnessed a large-scale computer outage caused by a faulty update from cybersecurity giant CrowdStrike. This incident, affecting millions of Windows devices globally, serves as a stark reminder of the domino effect that software errors can have. Since then, CrowdStrike and other industry experts have shared their preliminary incident report in which they outline the incident and the steps they will take to prevent future issues like this. That said, this was no small bug – the errors resulted in over $5.4 billion in revenue loss for just Fortune 500 companies. Even with the best forward-looking intentions, this event caused irreparable reputational harm that will surely cost the company in the future.
Bugs are always a risk, but you can protect your business by implementing the right development and testing processes. In this blog post, we will explore what happened, the impact it had, and how you can prevent similar disasters for your app.
What Caused the CrowdStrike Outage?
The culprit behind the outage was a seemingly innocuous update – a sensor configuration update for CrowdStrike’s Falcon antivirus software on Windows machines. This update contained a logic error that triggered system crashes and the dreaded Blue Screen of Death (BSOD) upon activation. The technical details, as explained by CrowdStrike themselves, point to a configuration update triggering a logic error. This error resulted in a system crash, causing affected devices to reboot repeatedly and become unusable.
The Global Impact
The impact of the CrowdStrike outage was widespread and disruptive. Businesses across various industries, including finance, healthcare, retail, and transportation, were left paralyzed. Critical operations were halted as employees couldn’t access their workstations. Individual users who rely on Windows machines for work or personal use were also left inconvenienced and, in some cases, at-risk. The airline industry was hit hard and was unable to recover for days, with passengers left stranded across the globe. More concerning, there were reports of 911 dispatchers unable to answer calls and deploy emergency services for upwards of 7 hours in some cases. The outage caused significant productivity losses across the globe, highlighting the interconnectedness of today’s digital landscape. It also highlighted potential overreliance on a single software provider – if Microsoft used different or multiple security software, the outage may not have been so widespread.
CrowdStrike themselves will face both financial and reputational battles moving forward. Businesses like Delta have already begun to pursue legal action to recover their losses. Even if they can handle the financial demands of the lawsuits, their reputation is irreparably damaged. Additionally, CrowdStrike’s stock value has dropped by a whopping 41% over the course of two weeks. Recovering from this will require significant effort, which most companies lack the resources to achieve.
How Could This Have Been Avoided?
In short, this outage was caused because a seemingly small update that didn’t go through the normal testing channels before being released. Bugs are unavoidable, but there are steps you can take to prevent them from having a catastrophic effect on your business. Here are some tips that can minimize your risk of leaking bugs:
- Implement a rigorous testing process. Ensure that all updates go through a thorough testing process before being released to the public. This includes unit testing, integration testing, and system testing.
- Use automated testing tools. Automated testing tools can help you catch bugs before they reach production. These tools can also help you identify potential issues before they become major problems.
- Perform regular code reviews. Code reviews can help you catch potential issues before they become major problems. They can also help you identify areas where your code can be improved.
- Implement a bug tracking system. A bug tracking system can help you keep track of all bugs and issues that arise during development. This can help you identify patterns and trends that can help you prevent similar issues in the future.
- Invest in parametric insurance. Parametric insurance can help protect your business from the financial impact of software errors. With Riskwolf, you can turn real-time data into insurance. Using unique real-time data and dynamic risk modelling, we enable insurers to build and operate parametric insurance at scale. Simple. Reliable. Fast.
In conclusion, the CrowdStrike outage serves as a stark reminder of the importance of implementing the right development and testing processes. By following the tips outlined in this post, you can minimize your risk of leaking bugs and protect your business from the financial impact of software errors. Don’t take chances with your business – get in touch with Riskwolf today to develop parametric insurance for your app.
Read part two of this series here.