CrowdStrike released a Preliminary Post Incident Review (PIR) on the faulty Falcon update explaining that a bug allowed bad data to pass its Content Validator and cause millions of Windows systems to crash on July 19, 2024.
The cybersecurity company explained that the issue was caused by a problematic content configuration update meant to gather telemetry on new threat techniques.
After passing the Content Validator, the update didn't go through additional verifications due to trust in previous successful deployments of the underlying Inter-Process Communication (IPC) Template Type. Therefore, it wasn't caught before it reached online hosts running Falcon version 7.11 and later.
The company realized the error and reverted the update within an hour.
However, by then, it was too late. Approximately 8.5 million Windows systems, if not more, suffered an out-of-bounds memory read and crashed when the Content Interpreter processed the new configuration update.
Inadequate testing
CrowdStrike uses configuration data called IPC Template Types that allows the Falcon sensor to detect suspicious behavior on devices where the software is installed.
IPC Templates are delivered through regular content updates that CrowdStrike calls 'Rapid Response Content.' This content is similar to an an antivirus definition update, allowing CrowdStrike to adjust a sensor's detection capabilities to find new threats without requiring full updates by simply changing its configuration data.
In this case, CrowdStrike attempted to push a new configuration to detect malicious abuse of Named Pipes in common C2 frameworks.
While CrowdStrike has not specifically named the C2 frameworks it targeted, some researchers believe the update attempted to detect new named pipe features in Cobalt Strike. BleepingComputer contacted CrowdStrike on Monday about whether Cobalt Strike detections caused the issues but did not receive a response.
According to the company, the new IPC Template Type and the corresponding Template Instances tasked with implementing the new configuration were thoroughly tested using automated stress testing techniques.
These tests include resource utilization, system performance impact, event volume, and adverse system interactions.
The Content Validator, a component that checks and validates Template Instances, checked and approved three individual instances, which were pushed on March 5, April 8, and April 24, 2024, without a problem.
On July 19, two additional IPC Template Instances were deployed, with one containing the faulty configuration, which the Content Validator missed due to a bug.
CrowdStrike says that due to baseline trust from the previous tests and successful deployments, no additional testing like dynamic checks was performed, so the bad update reached clients, causing the massive global IT outage.
However, based on the PIR, Rapid Response Content uses automated testing instead of being tested locally on internal devices, which would likely have detected the issue.
CrowdStrike says they will introduce local developer testing for future Rapid Response Content, as explained below.
New measures
CrowdStrike is implementing several additional measures to prevent similar incidents in the future.
Specifically, the firm listed the following additional steps when testing Rapid Response Content:
- Local developer testing
- Content update and rollback testing
- Stress testing, fuzzing, and fault injection
- Stability testing
- Content interface testing
Moreover, additional validation checks will be added to the Content Validator, and error handling in the Content Interpreter will be improved to avoid such mistakes leading to inoperable Windows machines.
In what concerns Rapid Response Content deployment, the following changes are planned:
- Implement a staggered deployment strategy, starting with a small canary deployment before gradually expanding.
- Improve monitoring of sensor and system performance during deployments, using feedback to guide a phased rollout.
- Provide customers with more control over the delivery of Rapid Response Content updates, allowing them to choose when and where updates are deployed.
- Offer content update details via release notes, which customers can subscribe to for timely information.
Crowdstrike has promised to publish a more detailed root cause analysis post in the future, and more details will become available after the internal investigation is completed.
Comments
electrolite - 3 months ago
Specifically, the firm listed the following additional steps when testing Rapid Response Content:
Local developer testing
Well duh! Go cause BSOD your own machine first before you go around causing BSODs at large.
powerspork - 3 months ago
Does this response indicate that they were not testing or staggering updates before this?
They relied on one automated system to verify the patch was "good"?
Yikes.
Speaking of BSOD your own machine first, what OS are they running that they did not notice this immediately?
NoneRain - 3 months ago
They're applying automated testing including "resource utilization, system performance impact, event volume, and adverse system interactions", but allegedly these failed due to a bug.
Yes, usually these updates are tested by an "automated system" that orchestrate different tests. That doesn't mean it always works like a single system neither the results are handled by the same system. The results may be checked and approved by staff, so, either they were not really checking tests results, or the results are a bunch of useless check marks, what itself would be atrocious.
NoneRain - 3 months ago
Check this out: https://community.fortinet.com/t5/Blogs/FortiEDR-s-software-and-content-update-release-process/ba-p/327255
FortiEDR have a really robust QA compared to even the new measures Crowdstrike is implementing (too late).