Business Continuity Management / Disaster Recovery , CrowdStrike Outage Updates , Endpoint Security
ISMG Editors: What the CrowdStrike Outage Taught Us So Far
Panelists Discuss Immediate and Long-Term Impact of Global Outage Anna Delaney (annamadeline) • July 19, 2024In this special edition of the ISMG Editors' Panel, CyberEd Board member Ian Thornton-Trump joined editors to discuss the fallout from the massive CrowdStrike IT outage, the reaction from the tech industry and how we can learn from the incident and create more resilient operations.
"We learned a valuable lesson about the fragility of our infrastructure," Thornton-Trump said. "We also learned how dangerous - we got a taste of what a global cyberattack would look like … I feel like we are at one of those inflection moments."
See Also: High Speed Flash! Storage Solutions Safeguarding Data & Rapid, Reliable Recovery
A faulty update to CrowdStrike Falcon software caused Windows PCs to crash and repeatedly reboot, leading to significant global disruptions for banking customers, hospital patient, airports, government agencies and retail shops. The first step of all affected organizations should be to get "on the phone to your customers, being open and transparent and giving them updates about when your services are going to come back online. This is really the key to handling this right now," he said.
The panelists - Thornton-Trump, CISO at Cyjax and a CyberEdBoard member; Anna Delaney, director of productions, ISMG; and Mathew Schwartz, executive editor, DataBreachToday and Europe, ISMG - discussed:
- The immediate and long-term effects of the outage on various sectors;
- The response from CrowdStrike and the broader cybersecurity community;
- Strategies for organizations to improve their resilience in the face of similar incidents.
The ISMG Editors' Panel runs weekly. Don't miss our previous installments, including the July 12 edition on how we should handle ransomware code flaws and the July 19 edition on how AT&T allegedly paid a ransom in the Snowflake breach.
Transcript
This transcript has been edited and refined for clarity.
Anna Delaney: Hello and welcome to this emergency edition of the ISMG Editors' Panel. I'm Anna Delaney, and today we're discussing a story that's dominating the headlines and being called the largest IT outage in history. This unprecedented event has grounded planes, canceled hospital appointments and closed shops worldwide. The cause - a faulty software update from CrowdStrike's Falcon sensor, which has led to a mass outage affecting Windows PCs. With me to discuss are CyberEdBoard member, Ian Thornton-Trump, CISO at Cyjax, and ISMG's Mathew Schwartz, executive editor of DataBreachToday and Europe. Thank you gentlemen for joining me on such short notice.
Ian Thornton-Trump: My pleasure.
Mathew Schwartz: Happy to be here Anna. It's a fascinating story.
Delaney: Exactly! It has been fascinating. I have both my parents impacted by this today. One was on a train; the other is at a hospital appointment. So, how funny is that - both parents are messaging me about my job. But, more seriously Mat, you've been working on this since this morning, since the early hours. Start off with a breakdown of what happened, why it happened and where we are now, considering the story is changing by the minute. It's Friday afternoon, 19th July, U.K. time.
Schwartz: Friday afternoon, every IT administrators hope and dream to get away for the weekends. And here we are seeing those hopes and dreams get quashed Anna. It's a horrible story on that front. And horrible as well, honestly, in terms of the disruption that we've been seeing. We've seen patient procedures canceled in the south of England and beyond. We've seen disruption to rail and plane travel throwing a lot of people's travel plans into chaos. Online banking unavailable. Self-service checkout kiosks not being able to be used or payment cards in some stores. So, there are widespread IT disruptions, and we can get into what that means, and if this should or shouldn't be a surprise to anyone, obviously present audience accepted. But, brief history - what seems to have happened? It looks like the problems began to occur around 6 P.M. Eastern Time in the U.S. on Thursday. So that would be round about midnight for Europe, and Australia was also among the first of the regions to report widespread outages. What we know now is that CrowdStrike pushed an update for its Falcon cybersecurity software, which runs on endpoints, PCs, servers, as well as virtual servers, and helps its software watch for bad stuff. Unfortunately, in this case, the bad stuff was this software update, which caused the software to crash. Not only that – it caused windows to reboot and then crash and then reboot and then crash. So IT administrators in Britain this morning, when I woke up, I was seeing them take the social media going "Help! My screens are stuck in this non-stop repeating blue screen of death." What happened after that? Pretty quickly, CrowdStrike confirmed it was investigating and that it seemed to be the culprit here. CrowdStrike fixed the software and started to push that via its automatic update channel, which is how it got to customers in the first place. With some systems, we've heard the update has taken care of the problem. With some other systems, it has eventually taken care of the problem. For example, Microsoft said that with virtual servers, up to 15 reboots of the servers have been reportedly required before it onboards the clean software. But we have also seen, unfortunately, that some of these systems stuck in the endless reboot are just rebooting, and there's no way to get the clean software on there unless administrators go and get hands on keyboard with the actual physical system and install a workaround. I'm sure Ian knows a lot more detail about this than I do. But, get a workaround on that system. Unfortunately, in an organization with hundreds and 1000s of systems, the complexity and the time required to make that happen is a big question mark. We're going to see teams working through the weekend and probably well beyond at some larger organizations to get these fixes in place … to break these machines off of this non-stop reboot cycle. So aside from that, very briefly, we've seen George Kurtz, the CEO of CrowdStrike, do the rounds. He's been in the news shows. CrowdStrike's put out a statement saying, "We're very sorry. We're getting to the bottom of this. Customers, please go to our support portal." "We're closely helping them," is what they've said. "Reach out to your liaison or a company representative, and we will help get you through this crisis." So that's the long and the short from a who, what, where, why and when standpoint Anna.
Delaney: Very well done. Excellent work. Bugs happen Mat. Does this problem lie at CrowdStrike or is this the result of poor resiliency on Microsoft's side?
Schwartz: I'd like to put this question to Ian, because he's seen software bugs throughout his career. What I'm hearing briefly though is that bugs do happen. And some people have come out and said, "Look, why is it that CrowdStrike was able to cause Windows systems to go into this perpetual cycle of reboot?" Yes, there's a quality assurance thing they should probably be doing better. But isn't Microsoft also somewhat responsible here for not seeing this sort of behavior and shutting it proactively down for end users.
Thornton-Trump: I'm in the belly of the beast at an information security conference here in Stoke. The talk has all been around CrowdStrike and the global impact. We know it gets serious when the London Stock Exchange has to shut down because that's millions of dollars per minute that are coursing through the very lifeblood of the economy. So, it's a big wake up call. Now, to be fair to CrowdStrike, they are owning this problem. They are certainly being transparent about it. It's going to release questions about things, and it will go a bit technical here on the CI/CD pipeline, quality assurance of the software and an overall philosophy of when something breaks, how to safeguard it. The biggest takeaway from this is we're going to have to look very closely at how we manage the resiliency of our software suppliers and make sure that we have a plan. Like I said in Mathew's article, we have a plan to deal at an incident response level with a bad vendor update.
Delaney: Ian, Mat said there that the first responders will be working all throughout the weekend for next few weeks. How much of a headache is this? How long will this take?
Thornton-Trump: It's a bit headache. It has failed in such a way that it can't bring up the operating system in a stable mode. This will either be a complete reinstallation for many machines, or if the fix that has been sent out about deleting the corrupted file, and again, how that corrupted file got consumed and got into the product raises some questions around the update to the system, because if this is a product deficiency from a cybersecurity perspective, then I'm glad CrowdStrike did it to themselves rather than a threat actor. So, there's a lot of lessons and takeaways here. But, for anyone out there that is sort of on the hyperbole train, I don't think this is an existential destruction of CrowdStrike. When I last checked the stock, it was down around 11%. It's a robust company, and they're doing everything right. They are only facing an issue. They're not shying away from it and trying to point in a different direction.
Delaney: Yeah, you're not the first that said that. Mat, you said something similar earlier. I reckon they'll be back next week where they went.
Schwartz: It's easy to get cynical, especially when you've been reporting on data breaches for as long as I have, and expecting to see changes and fixes. But to be fair, this is a bug as you said Anna, and bugs happen, and so the question is, how does an organization deal with it? CrowdStrike is a big player. They've got some very smart people inside the organization. I firmly expect them to deal with this incredibly quickly and thoroughly. As Ian said, there's some great stuff you can do for fail states. We've seen this sort of problem before, and companies have built their software in such a way to prevent this sort of thing from happening ever again, and we'll see that extremely quickly. Wall Street chatter is that there's going to be no long-term impact here. As we see with data breaches, there's almost never a long-term impact unless you're a cryptocurrency exchange that loses all of its cryptocurrency, which CrowdStrike most assuredly is not.
Delaney: But, the big question is, wasn't this inevitable when companies are so reliant on such few vendors - dominant cloud vendors? What do you think Ian?
Thornton-Trump: It raises a big question about certain market capitalizations. CrowdStrike is about $75 billion roughly in market capitalization. They are everywhere. They're the darling of a lot of Fortune 500 and Betsy 500 companies. This is a big issue. They have now felt the pain. No question in my mind. They are in a financial position where they can be very generous to organizations that have suffered impacts. You're going to see inevitably as we see in the wake of these type of things. Class action lawsuits and attempts for restitution and damages and all those type of things. That's an inevitability of a failure of this magnitude. But, they are in a good cash position as an organization, and they have the opportunity to come out from this looking quite good. That's what we need to focus on. I will add another thing - I've seen some tremendously disappointing commentary from other vendors, like "Having problems with CrowdStrike? Switch to our brand" - that is just inappropriate when we're dealing with a global crisis. And I feel like that's attracting some bad karma. My warning, as I've been doing threat intelligence for probably as long as Mathew's been writing about data breaches, I'm telling you right now, it will become your time, and you will have wished for better karma. So the thing is like let's take this opportunity to unite around our brothers and sisters that are struggling right now with massive technological challenges in front of them. You're right. It's going to take a long time to clean up. There's some folks that are looking at just restoring their infrastructure from their backups if they had them, rather than kind of messing around with this. Bringing them back to that free update state. So some organizations will have been ready for this. Because, Mathew, maybe you have some thoughts here too. This does kind of feel a little bit like a data destruction attack that are being unleashed on many companies from Saudi Aramco to more recently, in the Ukraine. So organizations get it, and it was on their risk register, maybe just not in the way of a bad update.
Schwartz: Definitely. That's a great point, and one of the saving graces here is that IT can go get hands on keyboards and nothing is going to be missing. We're not looking at data exfiltration prior to ransomware being unleashed, but hopefully organizations have learned a lot of lessons from those attacks. And like Ian said, they can just restore to the last known good state. One of the kickers though is that these sorts of restorations often take a lot more time than organizations are expecting. If you're a small doctor's office with IT expertise, maybe you can restore your 5 or 10 systems in an afternoon. But, a lot of larger organizations, they can't restore everything all at once, and then it becomes a question of triage, also making sure the restorations did happen successfully. Meanwhile, you've got productivity impacted along the way. So long-term prognosis is good, but the short-term headache likelihood also seems to be pretty high.
Delaney: The irony is, of course, it's Friday afternoon.
Schwartz: Always.
Delaney: While it's Thursday in London. Always the case. So Ian, as organizations are rushing to address the issue, what are the vulnerabilities that they are potentially exposed to and how might cybercriminals exploit these weaknesses?
Thornton-Trump: Yeah, this is a good question, because if you're backing out your EDR solution and you're bringing systems down in an unsecure state and then bringing them back up, you better hope that your organization can then reapply the EDR solution with the updates as required before it gets hammered by cybercriminals. So it's definitely an opportunity. Now, at Cyjax, we've been carefully monitoring the criminal community on the various forums, and so far, there hasn't been sort of like "Woohoo! It's a free buffet at the restaurant." But, they don't necessarily share all their evil, sinister plans openly. So, we are in a precarious state. We learned a valuable lesson about the fragility of our infrastructure. We got a taste of what a global cyberattack would look like and how dangerous it can be. I was making the joke that this is "One-off Strike?", but maybe that's poor colored, I don't know. But, I feel like we are at one of those inflection moments, just like a couple of weeks ago, when again Mathew entertained my thoughts on making MFA mandatory. We're at this point now where we need to think about that resiliency and how are we going to get back on our feet no matter what. And this could have been a quick link in many organizations that resulted in this type of this damage or disruption. So this is a great live-fire exercise, as they like to say. And, let's hope that the sweat that is being put into restoring, like Mathew said, the operations without a loss of data, you're going to have to back out transactions on databases and reapply stuff like that. It's not going to be like flip it all back on and everything is good. These are disruptive events for a reason, but it's a great exercise. It's maybe everyone didn't want to have it on what is appearing to be one of the more beautiful days in the U.K.
Delaney: Exactly. So what's your advice to organizations right now? What would you be telling them?
Thornton-Trump: First, work the problem, and the problem is we've had this bad update, and some systems are broken, some systems are damaged and some systems are likely running. This is now time to declare a major crisis and emergency response, bring in the necessary resources that you need, that may include pizza and coffee, and start bringing back the systems that are most important. What are your crown jewels? What is the thing that makes you money? What is the thing that allows you to take money from people? And what is the service you provide to get that money? And just move as quickly as you can possibly do to get the crown jewels back up and running. Workstations - they're commodities. No one cares right now at this moment about the fact that the workstations in call centers aren't necessarily working. What you need to do is make sure you're looking after customers. From an executive perspective, you should be on the phone attending to your customers, being as open and transparent and giving them updates about when your services are going to come back online. This is the key to handling this right now. It's not a technical issue. We know what to do. The question now is be open, transparent and honest and tell folks when you're going to be back in business, and to ask them to give you a bit of patience, because there's many businesses out there in this predicament, and what we need right now is a little breathing room to do the best that we can.
Delaney: Well said, and also just all our thoughts with anybody working on this weekend tirelessly. Blood, sweat and tears, weekend plans cancelled, as you say, glorious days and weekends of the year. Thank you. Well team, thank you so much for your analysis. It's been great to discuss, and we know that this is a developing story. So, we'll continue to monitor the situation closely. Stay tuned for further updates. But, in the meantime, thank you so much Mat and Ian for your excellent insights.
Schwartz: Thanks for having us Anna.
Delaney: Thank you so much for watching. Until next time.