The CrowdStrike Outage is an Opportunity
Lets think about why we allow streaming updates to production, and what might be some other options
CrowdStrike is in the news today as they released a faulty driver as part of their Falcon Sensor which caused millions of Windows systems around the world to go into a boot loop. For details on the issue, you can read their official statement on the topic. My condolences to the various teams that are dealing with that all right now. Stay Strong!
I'm not going to jump on the bandwagon and bash CrowdStrike in this article, though. There is plenty of that out there already. Instead, I thought I would reflect a bit on what I see as a big miss with Endpoint Security Software in general: unfettered access to production.
Let me pose a leading question
Would you ever allow a third party to be able to install software on your production systems without testing it first in some kind of non-production environment? How about software that explicitly has full system or kernel level access across your entire infrastructure? The answer in 99% of instances would be "Absolutely not", yet that 1% is allowed in the case of AV software. In fact, it has become a thing that many people assume should be allowed and most don't question unless they are in highly sensitive areas. Why is that?
I believe the main reason is simple: We want timely updates in our security software. It just makes sense, right? Zero-day vulnerabilities are a real concern and timely updates as soon as they are available is just about the best thing we can do to address the problem. With that approach, though, comes the requirement to push those changes to prod as quickly as possible. That risk is just assumed to be worth it, but is it really? Or, rather, should we always assume that it is always worth the risk?
According to IBM’s X-Force threat intelligence team, zero-day vulnerabilities account for about 3% of all vulnerabilities. That's not insignificant though it is small. Considering that nation-states like China are some of the top abusers of these vulnerabilities, though, the stakes can be surprisingly high. However, when you look at some key findings from the Verizon Data Breach Investigations Report (DBIR), exploits and vulnerabilities are not the leading way-in to a system - that's still credential leakage and phishing. Further, a full 68% of breaches involved the human element... not directly tied to exploits. So my next leading question is simple: Is the risk of system breach due to out of date security software and virus definitions always so severe that we need to make an exception to our production controls?
Going back to basics
Servers should be locked down from general user access, and even admin access should require quite a bit of proof of identity and need. These are the most basic of basics. In fact, there should be many other controls in place for reaching servers, including network isolation and segmentation. If we do these things well, we limit vulnerability exploitation to direct physical or network access only. This buys us time to do things like test our Security Software updates before rolling them out.
Commonly accepted practice with our own application deployments is that we roll out changes incrementally, sometimes by region, availability zone, or some other logical grouping. Using some kind of Canary or similar pattern allows us to watch for impact of the current change set that is going live. Yet we do not have that requirement with Security Software either. In fact, such software usually has a separate update channel entirely that is minimally configurable. At the very least, we should be rolling out updates to security software incrementally to ensure it doesn't break our systems or the services running on them.
Kiosk systems and end-user devices are certainly different as they are exposed more directly to the human element. As such, the do likely require more timely updates. However, I believe there are other compensating controls that can be implemented here as well - chief among them being zero-trust type systems where access to other services is controlled more by user identity than system level security. Proper segmentation goes a long way to buying us time that we can use to test a roll out of a security update.
Conclusion
I am not advocating that we stop updating our security software quickly, nor am I suggesting that we keep AV software off of our servers (though I think there is a case for that in some instances, heresy, I know). What I am suggesting, though, is that we have a lot of assumptions in the security update space that we should reevaluate, chief among them is that we need to allow endpoint security software to update itself immediately without looking at the risks involved.
I believe that we have gotten lucky with updates in the past on the endpoint security side of things. It's clear that vendors in this space spend a lot of time making sure their tools don't break things, but that isn't something we can rely on. Let's look at the opportunity in the CrowdStrike outage. We can guide our teams into taking a step back and examining our assumptions around what we are currently doing for endpoint security updates. Then we can think on ways to improve on our operational practices to reduce our risk and prevent things like this outage from happening again.