A Catastrophic System Update and a Huge Failure of QA

In a shocking display of incompetence, millions of computers around the world simultaneously became unusable, all thanks to a bug that led to the dreaded “Blue Screen of Death.”

CrowdStrike, a US cybersecurity company based in Texas, offers ransomware, malware, and internet security products primarily to businesses and large organizations. But on Friday, July 19, they released a sensor configuration automatic update on their Falcon program targeting Windows systems. This reckless update wreaked havoc globally.

Falcon sensor, a cybersecurity program providing automated malware protection, antivirus support, incident response, and other security features, is cloud-based. This means it operates alongside CrowdStrike’s servers, without requiring customers to manage extra equipment or software. Yet, the company’s gross negligence in quality assurance and testing allowed a disastrous bug to slip through.

CrowdStrike stated these types of updates happen multiple times a day. However, this routine update triggered a catastrophic “logic error” that caused Windows systems to crash. The update was meant to target malicious system communication tools but instead plunged millions into chaos.

Millions of Windows PC users reported seeing a “Blue Screen of Death” on their devices, with many systems trapped in a relentless reboot loop.  Thousands of flights were grounded, causing chaos for travelers, while banks reported disruptions to critical online transactions. TV broadcasters and telecom operators also faced significant issues, adding to the widespread confusion. To make matters worse, several 911 operators across the US were unable to respond to emergencies for several hours on Friday morning, putting countless lives at risk. This is an outrageous failure of responsibility and competence.

While it may be possible to escape the reboot loop by manually entering SAFE MODE, most users have no clue how, almost all enterprise users do not have admin rights to do so, and millions of kiosks and POS terminals lack any traditional mouse or keyboard to be able to access that mode, rendering them dead until an IT professional can be called in to fix them. One by one.

This entire incident highlights a glaring lack of proper testing and quality assurance within the company, raising serious concerns about their operational practices and commitment to their customers’ security.

Root Cause

The cause of this catastrophe is clear. The company moved to a DevOps execution mode some years ago in order to push out updates multiple times a day. As the updates became more frequent, the amount of testing continued to fall. And therein lies the trap. Testing less is NEVER acceptable, even when software tools tell you that a patch or update needs limited testing. Because all software today is immensely complex and has many interdependencies it is almost impossible to be absolutely sure that even a small patch will not cause problems somewhere sometimes in some systems (or browsers or phones). Testing less is a symptom of a broken system which recognizes it cannot test everything in 2 hours so abandoned that safety net to test less. And this is the result.

Did CrowdStrike test this release at all in any Windows systems? While I don’t have inside knowledge the answer is clear. It was untested. They had become so complacent and sure of their processes, after having thousands of updates go off without a hitch, that testing effectively ceased. There is no other explanation. Since this essentially renders useless any Windows version past Windows 7.11, there simply is no other explanation other than complacency leading to blue screens of death.

Lessons

This is a huge mess that could have been avoided.

The worldwide cost of IT intervention and lost productivity? Many Billions.

Cost to Crowdstrike’s market cap? Billions.

Money saved by not testing that update. $1000. Max.

Nice work.

As AI continues to be able to create massive end to end scripts and tests in minutes, there simply is no reason to not test your releases to the fullest extent every time. Testing less leads to a view that we are getting away with testing less, and less, and less. Until it blows up in your face. While many here will argue, the only reason to test less was time and cost. As time and cost head to zero with AI, we must leave that reasoning behind and test everything fully.

Don’t have egg on your face. Test more. And leverage AI to generate, update, maintain and run those tests on your fully integrated systems before release.

Appvance IQ (AIQ) covers all your software quality needs with the most comprehensive autonomous software testing platform available today.  Click here to demo today.

Recent Blog Posts

Read Other Recent Articles

Silos between Development (Dev), Quality Assurance (QA), and Operations (Ops) teams often hinder efficiency, innovation, and speed. Each team has distinct goals: developers prioritize building features, QA ensures quality, and Ops focuses on stability. When these teams operate in isolation, communication gaps can lead to delays, bottlenecks, and product issues. This is where TestOps comes

It’s a mobile-driven world and apps have become an integral part of our daily lives, serving everything from communication to banking, shopping, and entertainment. For businesses, the stakes are high. A slow, buggy, or insecure mobile app can frustrate users, damage brand reputation, and result in lost revenue. Ensuring the highest levels of performance, security,

Security breaches can cripple a company’s operations, damage its reputation, and lead to severe financial repercussions. Cyber threats continue to evolve, becoming increasingly sophisticated as attackers exploit even the smallest vulnerabilities in application code. As businesses accelerate their digital transformations, the need to protect applications from security threats is more critical than ever. A robust

Empower Your Team. Unleash More Potential. See What AIQ Can Do For Your Business

footer cta image
footer cta image