CrowdStrike Outage Updates , Endpoint Protection Platforms (EPP) , Endpoint Security

CrowdStrike Says Code-Testing Bugs Failed to Prevent Outage

Cybersecurity Vendor's Preliminary Review Details Problems, Promises Improvements
CrowdStrike Says Code-Testing Bugs Failed to Prevent Outage

CrowdStrike has blamed internal testing failures, including buggy testing software, for failing to prevent the faulty "rapid content update" Friday that caused worldwide disruption.

See Also: 2024 Threat Hunting Report: Insights to Outsmart Modern Adversaries

The company on Tuesday published its preliminary review into the incident, involving the faulty Channel File 291 for its Falcon endpoint detection and response software.

After receiving the threat update data, 8.5 million online Windows hosts that used Falcon crashed out to a "blue screen of death," rebooted and then got stuck in an endless crash and reboot loop. Reflecting the types of organizations that use Falcon, the disruption led to serious outages across numerous critical sectors, including for major healthcare, banking, stock market and media organizations, as well as railways and airlines.

The report from CrowdStrike details what happened and when, as well as the steps the company will take to try and prevent a repeat occurrence. The company has also pledged to release a full "root cause analysis" into the incident once it completes its investigation.

Security experts have saluted the timeliness and detail contained in CrowdStrike's initial review. "It's good and really honest," said British cybersecurity expert Kevin Beaumont.

One "key takeaway," he said, is that CrowdStrike has committed to a "smart" change - it will no longer deploy threat updates simultaneously to every Falcon endpoint, but rather will deploy them in a more careful, gradual and well-monitored process.

Many other security software vendors, including Microsoft, already don't push endpoint protection platform updates simultaneously to every client. This helps the initial deployments serve as a canary in the coal mine, in case something unexpected occurs.

CrowdStrike pushed the faulty Falcon configuration update Friday at 04:09 UTC, leading to crashes. Seventy-eight minutes later, the company "reverted" the file. Some systems successfully rebooted, received the new file and recovered. Many more systems have required manual intervention.

Multiple airlines were temporarily grounded Friday due to the incident, stranding travelers. U.S. carrier Delta has been especially hard-hit, although it has been recovering. By Tuesday, the airline canceled just 14% of its flights, compared to 36% on Sunday, reported flight tracking service FlightAware.

As of Monday, IT asset tracking provider Sevco Security reported seeing 93% recovery rates of CrowdStrike Falcon software among its client base.

Both CrowdStrike and Microsoft have released tools to help automate the process, but many must be run from bootable USB drives and therefore require remote workers to come on-site to get a fix.

On Tuesday, CrowdStrike delivered a previewed update - it added the faulty file to the CrowdStrike Cloud's list of known bad files, since the faulty file likely still resided on numerous systems, even if it was no longer being accessed. The update was effective immediately for customers who use the company's US-1, US-2 and EU clouds, and it is available on demand for government customers.

One immediate upside from the move is that "for impacted systems with strong network connectivity, this action could also result in the automatic recovery of systems in a boot loop," since affected systems may attempt to contact the CrowdStrike Cloud for updates and receive instructions to excise the bad file, the company said.

For organizations that use full-disk encryption, which is considered a best practice and is required by some regulations, recovering systems often requires entering a unique 48-digit key to unlock BitLocker full-disk encryption, which adds time and complexity to the recovery process (see: CrowdStrike Disruption Restoration Is Taking Time).

Preliminary Report

In its preliminary review into the incident, CrowdStrike said that on Feb. 28, it released an update to its Falcon sensor - version 7.11, which gives the sensor new functionality for detecting threats, via an InterProcessCommunication, or IPC, template type. These templates are designed "to detect novel attack techniques that abuse named pipes," which are operating system processes.

The IPC templates are distributed "in a proprietary binary file that contains configuration data," which CrowdStrike said "is not code or a kernel driver." The configuration data "maps to specific behaviors for the sensor to observe, detect or prevent."

The company said it successfully stress-tested the new IPC template type on March 5, using "a variety of operating systems and workloads." In April, the company pushed three new, separate IPC templates to users, which "performed as expected in production."

The opposite happened Friday, when CrowdStrike pushed two new IPC templates to Falcon endpoints, and one of the templates "passed validation despite containing problematic content data," it said. "When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD)."

Upcoming Testing and Deployment Changes

The company has promised to introduce a number of software resiliency and testing improvements, ranging from more thorough and varied types of testing to updating the Content Interpreter in its software to better handle unexpected errors.

For rolling out future rapid response content, CrowdStrike said it will "implement a staggered deployment strategy," gradually rolling out updates globally after "starting with a canary deployment." The company said it will also give customers "greater control" over updates, including "when and where" they are deployed, and will more closely monitor collective "sensor and system performance" to guide future content rollouts.

Security experts said the impact of a single faulty CrowdStrike software update reveals bigger-picture industry problems tied not just to technology but also interconnectivity (see: CrowdStrike, Microsoft Outage Uncovers Big Resiliency Issues).

"We have a small number of cyber companies effectively operating as God Mode on the world's economy now," Beaumont said in a blog post. A more ideal scenario, he said, would involve customers being able to "have zero trust in cybersecurity vendors."

Given how interconnected software and the safe functioning of so many different parts of society are, "there needs to be some way to enforce less risky behavior across all vendors," Beaumont said. "This should also include Microsoft's security solutions."


About the Author

Mathew J. Schwartz

Mathew J. Schwartz

Executive Editor, DataBreachToday & Europe, ISMG

Schwartz is an award-winning journalist with two decades of experience in magazines, newspapers and electronic media. He has covered the information security and privacy sector throughout his career. Before joining Information Security Media Group in 2014, where he now serves as the executive editor, DataBreachToday and for European news coverage, Schwartz was the information security beat reporter for InformationWeek and a frequent contributor to DarkReading, among other publications. He lives in Scotland.




Around the Network

Our website uses cookies. Cookies enable us to provide the best experience possible and help us understand how visitors use our website. By browsing bankinfosecurity.com, you agree to our use of cookies.