This is an unpopular opinion, and I get why – people crave a scapegoat. CrowdStrike undeniably pushed a faulty update demanding a low-level fix (booting into recovery). However, this incident lays bare the fragility of corporate IT, particularly for companies entrusted with vast amounts of sensitive personal information.
Robust disaster recovery plans, including automated processes to remotely reboot and remediate thousands of machines, aren’t revolutionary. They’re basic hygiene, especially when considering the potential consequences of a breach. Yet, this incident highlights a systemic failure across many organizations. While CrowdStrike erred, the real culprit is a culture of shortcuts and misplaced priorities within corporate IT.
Too often, companies throw millions at vendor contracts, lured by flashy promises and neglecting the due diligence necessary to ensure those solutions truly fit their needs. This is exacerbated by a corporate culture where CEOs, vice presidents, and managers are often more easily swayed by vendor kickbacks, gifts, and lavish trips than by investing in innovative ideas with measurable outcomes.
This misguided approach not only results in bloated IT budgets but also leaves companies vulnerable to precisely the kind of disruptions caused by the CrowdStrike incident. When decision-makers prioritize personal gain over the long-term health and security of their IT infrastructure, it’s ultimately the customers and their data that suffer.
I don’t think it’s that uncommon an opinion. An even simpler version is the constant repeats over years now of information breaches, often because of inferior protect. As a amateur website creator decades ago I learned that plain text passwords was a big no-no, so how are corporation ITs still doing it? Even the non-tech person on the street rolls their eyes at such news, and yet it continues. CrowdStrike is just a more complicated version of the same thing.
The real problem is the monopolization of IT and the Cloud.
For sure there is a problem, but this issue caused computers to not be able to boot in the first place, so how are you gonna remotely reboot them if you can’t connect to them in the first place? Sure there can be a way like one other comment explained, but it’s so complicated and expensive that not all of even the biggest corporations do them.
Contrary to what a lot of people seem to think, CrowdStrike is pretty effective at what it does, that’s why they are big in the corporate IT world. I’ve worked with companies where the security team had a minority influence on choosing vendors, with the finance team being the major decision maker. So cheapest vendor wins, and CrowdStrike is not exactly cheap. If you ask most IT people, their experience is the opposite of bloated budgets. A lot of IT teams are understaffed and do not have the necessary tools to do their work. Teams have to beg every budget season.
The failure here is hygiene yes, but in development testing processes. Something that wasn’t thoroughly tested got pushed into production and released. And that applies to both Crowdstrike and their customers. That is not uncommon (hence the programmer memes), it just happened to be one of the most prevalent endpoint security solutions in the world that needed kernel level access to do its job. I agree with you in that IT departments should be testing software updates before they deploy, so it’s also on them to make sure they at least ran it in a staging environment first. But again, this is a tool that is time critical (anti-malware) and companies need to have the capability to deploy updates fast. So you have to weigh speed vs reliability.
Booting a system or recovery image remotely over an IPMI or similar interface is not complicated or expensive. It is one of the most basic server management tasks. You acting like the concept is challenging seriously concerns me and I seriously wonder how anyone that thinks like that gets hired.
There are exceptions, granted. However, the IT budget at most mid to large-size corporations is extremely bloated. I don’t think you can in good faith argue otherwise, unless you want to show me a budget that isn’t. Do you have a real one that you can provide?
These companies don’t even attract smart talent. They attract people that are complacent with doing nothing & collecting a paycheck. Smart people do not continue to work at these companies. The bureaucracy and management is soul-sucking. It took me a while to accept it too. I used to be optimistic thinking there is a logical explanation that can be fixed. Turns out they don’t want to be fixed. They like to be broken. Like I said, it starts from the top down. A lot of the staff wouldn’t even have a job if people actually tried to make things better.
It is one of the most basic server management tasks.
Except these were endpoint machines, not servers. Things grinded to a halt not because servers went down, but because the computers end users interacted with crashed and wouldn’t boot, kiosk and POS systems included.
You acting like the concept is challenging seriously concerns me and I seriously wonder how anyone that thinks like that gets hired.
Damn, I guess all the IT people running the systems that were affected aren’t fit for the job.
unless you want to show me a budget that isn’t. Do you have a real one that you can provide?
Can you show me the bloated budgets and where they are allocated on those mid to large size corporations? You are the one who insinuated that. All I said is that my experience for all the companies I worked with is that we always had to fight hard for budget, because the sales and marketing departments bring in the $$$ and that’s only what the executives like to see, therefore they get the budget. If your entire working experience is that your IT team had too much budget, then consider yourself privileged.
It’s weird how you’re all defensive and devolve to insults when people are just responding to your post.
Except these were endpoint machines, not servers. Things grinded to a halt not because servers went down, but because the computers end users interacted with crashed and wouldn’t boot, kiosk and POS systems included.
Endpoint machines still have IPMI type of interfaces and PXE. When you manage thousands of machines, if you treat them all like a pet then you’re doing it wrong.
Damn, I guess all the IT people running the systems that were affected aren’t fit for the job.
Is it going to take them several days to weeks to recover? Then they aren’t fit for the job, or should consider another profession.
Can you show me the bloated budgets and where they are allocated on those mid to large size corporations?
All of them. The Form 10k fillings are available for public corporations. The ones claiming that they will be impacted for a while are the ones I’m concerned most about.
It’s weird how you’re all defensive and devolve to insults when people are just responding to your post.
I spent a career arguing with sales reps who had one goal in mind, and that was to make the biggest commission possible. I sound argumentative because those sales reps had every tool imaginable to show up out of no where.
The underlying problem is that legal security is security theatre. Legal security provides legal cover for entities without much actual security.
The point of legal security is not to protect privacy, users, etc., but to protect the liability of legal entities when the inevitable happens.
neglecting the due diligence necessary to ensure those solutions truly fit their needs.
CrowdStrike perfectly met their needs by proving someone else to blame. I don’t think anybody is facing any consequences for hiring CrowdStrike. These bad incentives are the whole point of the system.
The crazy thing is CrowdStrike basically shutdown a ton of really important things and their stock only went down 17%. Like it was a huge blow to the economy for a couple days and somehow investors were like “meh, not that bad”
That’s because they had a lot of people “buying the dip”. CS is in a very similar position to SolarWinds during their 2020 security slipup. The extent of managerial issues there should’ve been unforgivable but unfortunately they got away with it and are doing just fine nowadays.
I don’t think anybody is facing any consequences for contracting with CrowdStrike.
This is the myth! As we all know there were very serious consequences as a result of this event. End users, customers, downstream companies, entire governments, etc were all severely impacted and they don’t give a shit that it was Crowdstrike’s mistake that caused the outages.
From their perspective it was the companies that had the upstream outages that caused the problem. The vendor behind the underlying problem is irrelevant. When your plan is to point the proverbial finger at some 3rd party you chose that finger still–100% always–points to yourself.
When the CEO of Baxter International testified before Congress to try to explain why people died from using tainted Heparin he tried to hand wave it away, “it was the Chinese supplier that caused this!” Did everyone just say, “oh, then that’s understandable!” Fuck no.
Baxter chose that Chinese supplier and didn’t test their goods. They didn’t do due diligence. Baxter International fucked up royally, not the Chinese vendor! The Chinese vendor scammed them for sure but it was Baxter International’s responsibility to ensure the drug was, well, the actual drug and not something else or contaminated.
Reference: https://en.wikipedia.org/wiki/2008_Chinese_heparin_adulteration
I would think that a FDA-ban on Chinese pharmaceuticals and an international arrest warrant for the Chinese suppliers C-suite should have been effected.
The fact that the US company CEO was liable and probably didnt spend a single day in a real prison cell is more likely outcome.
bloated IT budgets
Can you point me to one of these companies?
In general IT is run as a “cost center” which means they have to scratch and save everywhere they can. Every IT department I have seen is under staffed and spread too thin. Also, since it is viewed as a cost, getting all teams to sit down and make DR plans (since these involve the entire company, not just IT) is near impossible since “we may spend a lot of time and money on a plan we never need”.
With most corporations, especially Fortune 500s… audit their budgets. The problem doesn’t start with IT. but with bad management from top down. This “cost center” you speak of is mostly what I’d expect to hear do-nothing middle-level managers tell their in-house employees when asking for a raise.
It feels like you have an agenda that you are trying to apply to the CrowdStrike event and just so happen to slandering IT as an innocent bystander to the agenda you are putting forward.
If you had to summarize the goal of your initial post in less then 10 words, what would it be?
Worked many high-level corp IT. Problem is them, not CrowdStrike.
Thanks for responding in good faith!
I agree that while CS did screw up in pushing out a bad update, only having a single vendor for a critical process that can take the whole business down is equally a screw up. Ideally companies should have had CS installed on half the systems and a secondary malware prevention system on every DR and “redundant” system. Having all of a company’s eggs in a single basket is very bad.
All the above being said; to properly implement a fully redundant, to the vendor level, system would require either double the support team, or a massive development effort to tie the management of the systems together. Either way, that is going to be very expensive. The point being: Reducing the budget of IT departments will further cause the consolidation of vendors and increase the number of vendor caused complete outage events.
Well said!
C++ is the problem. C++ is an unsafe language that should definitely not be used for kernel space code in 2024.
Let’s rewrite everything in Rust. That’ll surely solve the world’s problems.
Thank you. Finally someone understands. Jokes aside though, I think we can acknowledge that C/C++ have caused decades of problems due to their lack of memory safety.
the virus definition is not written in c++. And even then, the problem was that the file was full of zeros.
Maybe I heard some bad information, but I thought the issue was caused by a null pointer exception in C/C++ code. If you have a link to a technical analysis of the issue I would be interested to read it.
No one does, it’s not public yet, if ever. This is close enough.
The real problem was, among others, lack of testing, regardless of the programming language used. Blaming C++ is dumb af. Put a chimpanzee behing the wheel of a Ferrari and you’ll still run into… problems.
I’ll reiterate, if it was a null pointer exception (I honestly don’t know that it was, but every comment I’ve made is based on that assumption, so let’s go with it for now) then I absolutely can blame C++, and the code author, and the code reviewer, and QA. Many links in the chain failed here.
C++ is not a memory safe language, and while it’s had massive improvements in that area in the last two decades, there are languages that make better guarantees about memory safety.
They said it was a “logic error”. so i think it was more likely some divide by zero or something like that
I think it’s most likely a little of both. It seems like the fact most systems failed at around the same time suggests that this was the default automatic upgrade /deployment option.
So, for sure the default option should have had upgrades staggered within an organisation. But at the same time organisations should have been ensuring they aren’t upgrading everything at once.
As it is, the way the upgrade was deployed made the software a single point of failure that completely negated redundancies and in many cases hobbled disaster recovery plans.
Speaking as someone who manages CrowdStrike in my company, we do stagger updates and turn off all the automatic things we can.
This channel file update wasn’t something we can turn off or control. It’s handled by CrowdStrike themselves, and we confirmed that in discussions with our TAM and account manager at CrowdStrike while we were working on remediation.
That’s interesting. We use crowdstrike, but I’m not in IT so don’t know about the configuration. Is a channel file, somehow similar to AV definitions? That would make sense, and I guess means this was a bug in the crowdstrike code in parsing the file somehow?
Yes to all of that.
Yes, CrowdStrike says they don’t need to do conventional AV definitions updates, but the channel file updates sure seem similar to me.
The file they pushed out consisted of all zeroes, which somehow corrupted their agent and caused the BSOD. I wasn’t on the meeting where they explained how this happened to my company; I was one of the people woken up to deal with the initial issue, and they explained this later to the rest of my team and our leadership while I was catching up on missed sleep.
I would have expected their agent to ignore invalid updates, which would have prevented this whole thing, but this isn’t the first time I’ve seen examples of bad QA and/or their engineering making assumptions about how things will work. For the amount of money they charge, their product is frustratingly incomplete. And asking them to fix things results in them asking you to submit your request to their Ideas Portal, so the entire world can vote on whether it’s a good idea, and if enough people vote for it they will “consider” doing it. My company spends a fortune on their tool every year, and we haven’t been able to even get them to allow non-case-sensitive searching, or searching for a list of hosts instead of individuals.
Thanks. That explains a lot of what I didn’t think was right regarding the almost simultaneous failures.
I don’t write kernel code at all for a living. But, I do understand the rationale behind it, and it seems to me this doesn’t fit that expectation. Now, it’s a lot of hypothetical. But if I were writing this software, any processing of these files would happen in userspace. This would mean that any rejection of bad/badly formatted data, or indeed if it managed to crash the processor it would just be an app crash.
The general rule I’ve always heard is that you want to keep the minimum required work in the kernel code. So I think processing/rejection should have been happening in userspace (and perhaps even using code written in a higher level language with better memory protections etc) and then a parsed and validated set of data would be passed to the kernel code for actioning.
But, I admit I’m observing from the outside, and it could be nothing like this. But, on the face of it, it does seem to me like they were processing too much in the kernel code.
There was a “hack” mentioned in another thread - you can block it via firewall and then selectively open it.
I’ve worked in various and sundry IT jobs for over 35 years. In every job, they paid a lot of lip service and performed a lot box-checking towards cybersecurity, disaster recovery, and business continuity.
But, as important as those things are, they are not profitable in the minds of a board of directors. Nor are they sexy to a sales and marketing team. They get taken for granted as “just getting done behind the scenes”.
Meanwhile, everyone’s real time, budget, energy, and attention is almost always focused on the next big release, or bug fixes in app code, and/or routine desktop support issues.
It’s a huge problem. Unfortunately it’s how the moden management “style” and late stage capitalism operates. Make a fuss over these things, and you’re flagged as a problem, a human obstacle to be run over.
Yep - it’s a CIO/CTO/HR issue.
Those of us designing and managing systems yell till we’re blue in the face, and CIO just doesn’t listen.
HR is why they use crap like CrowdStrike. The funny thing is, by recording all this stuff, they become legally liable for it. So if an employee intimates they’re going to do something illegal, and the company misses is, but it’s in the database, they can be held liable in a civil case for not doing something to prevent it.
The huge companies I’ve worked at were smart enough to not backup any comms besides email. All messaging systems data were ephemeral.
by recording all this stuff, they become legally liable for it
That is a damned good point and kind of hilarious. Thanks for the meaningful input (and not just being another Internet Reply Guy like some others on here).
I’m currently working for a place that has had recent entanglements with the govt for serious misconduct that hurt consumers. They have multiple policies with language in it to reduce documentation that could get them in trouble again. But minimal attention paid to the actual issues that got them in trouble.
They are more worried about having documented evidence of bad behavior than they are of it occurring.
I’m certain this is not unique to this company.
everyone’s real time, budget, energy, and attention is almost always focused on
the next big release, or bug fixes in app code, and/or routine desktop support issuespointless meetings, unnecessary approval steps that could’ve been automated, and bureaucratic tasks that have nothing to do with your actual job.FTFY.
Where you spend more time talking about what you’re going to do, than ever actually doing it.
Where when you ask for a mirror of production to test in, you’re told that Bob was working on that (Bob left 5 years ago).
particularly for companies entrusted with vast amounts of sensitive personal information.
I nodded along to most of your comment but this cast a discordant and jarring tone over it. Why particularly those companies? The CrowdStrike failure didn’t actually result in sensitive information being deleted or revealed, it just caused computers to shut down entirely. Throwing that in there as an area of particular concern seems clickbaity.
It was to elaborate that there is a bigger issue here with corporate IT culture that is broken. The CrowdStrike incident merely exposes it, but CrowdStrike isn’t the real problem. Remediation for an event like this, especially once the fix is known, should be 30 minutes… not weeks or months.
The OS should be mature enough by now that it could automatically recover from crashing on the load of a bad 3rd party driver. But it was not, wtf.
It can, sort of. Safe mode will still boot just fine. But then what should it do? Just blacklist the driver and reboot? That’s not going to work too well if it’s the storage driver.
Microsoft has been too busy building a new Outlook PWA with ads in your email, and AI laptops that capture screenshots of your desktop in unencrypted folders.
Is there a way to remotely boot into network activated recovery mode? Genuine question, I never looked into it.
For physical servers there are out of band management systems like Dell DRAC that allows you to manage the server even when the OS is broken or non existent.
For clients there are systems like Intel vPRO and AMD AMT. I have not used either of them but they apparently work similarly to the systems used on servers.
A expensive kvm card, or Pikvm for the home server.
At least for virtual servers, There has to be a cheaper software equivalent, as my cheap VPS allows this (via vnc) with no issues.
Virtual servers (as opposed to hardware workstations or servers) will usually have their “KVM” (Keyboard Video Mouse) built in to the hypervisor control plane. ESXi, Proxmox (KVM - Kernel Virtual Machine), XCP-ng/Citrix XenServer (Xen), Nutanix (KVM-like), and many others all provide access to this. It all comes down to what’s configured on the hypervisor OS.
VMs are easy because the video and control feeds are software constructs so you can just hook into what’s already there. Hardware (especially workstations) are harder because you don’t always have a chip on the motherboard that can tap that data. Servers usually have a dedicated co-computer soldered onto the motherboard to do this, but if there’s nothing nailed down to do it, your remote access is limited to what you can plug in. PiKVM is one such plug-in option.
Thank you for the explanation, I really appreciate it. Bystanders will probably too :)
Getting production servers back online with a low level fix is pretty straightforward if you have your backup system taking regular snapshots of pet VMs. Just roll back a few hours. Properly managed cattle, just redeploy the OS and reconnect to data. Physical servers of either type you can either restore a backup (potentially with the IPMI integration so it happens automatically), but you might end up taking hours to restore all data, limited by the bandwidth of your giant spinning rust NAS that is cost cut to only sustain a few parallel recoveries. Or you could spend a few hours with your server techs IPMI booting into safe mode, or write a script that sends reboot commands to the IPMI until the host OS pings back.
All that stuff can be added to your DR plan, and many companies now are probably planning for such an event. It’s like how the US CDC posted a plan about preparing for the zombie apocalypse to help people think about it, this was a fire drill for a widespread ransomware attack. And we as a world weren’t ready. There’s options, but they often require humans to be helping it along when it’s so widespread.
The stinger of this event is how many workstations were affected in parallel. First, there do not exist good tools to be able to cover a remote access solution at the firmware level capable of executing power controls over the internet. You have options in an office building for workstations onsite, there are a handful of systems that can do this over existing networks, but more are highly hardware vendor dependent.
But do you really want to leave PXE enabled on a workstation that will be brought home and rebooted outside of your physical/electronic perimeter? The last few years have showed us that WFH isn’t going away, and those endpoints that exist to roam the world need to be configured in a way that does not leave them easily vulnerable to a low level OS replacement the other 99.99% of the time you aren’t getting crypto’d or receive a bad kernel update.
Even if you place trust in your users and don’t use a firmware password, do you want an untrained user to be walked blindly over the phone to open the firmware settings, plug into their router’s Ethernet port, and add
https://winfix.companyname.com
as a custom network boot option without accidentally deleting the windows bootloader? Plus, any system that does that type of check automatically at startup makes itself potentially vulnerable to a network-based attack by a threat actor on a low security network (such as the network of an untrusted employee or a device that falls into the wrong hands). I’m not saying such a system is impossible - but it’s a super huge target for a threat actor to go after and it needs to be ironclad.Given all of that, a lot of companies may instead opt that their workstations are cattle, and would simply be re-imaged if they were crypto’d. If all of your data is on the SMB server/OneDrive/Google/Nextcloud/Dropbox/SaaS whatever, and your users are following the rules, you can fix the problem by swapping a user’s laptop - just like the data problem from paragraph one. You just have a team scale issue that your IT team doesn’t have enough members to handle every user having issues at once.
The reality is there are still going to be applications and use cases that may be critical that don’t support that methodology (as we collectively as IT slowly try to deprecate their use), and that is going to throw a Windows-sized monkey wrench into your DR plan. Do you force your uses to use a VDI solution? Those are pretty dang powerful, but as a Parsec user that has operated their computer from several hundred miles away, you can feel when a responsive application isn’t responding quite fast enough. That VDI system could be recovered via paragraph 1 and just use Chromebooks (or equivalent) that can self-reimage if needed as the thin clients. But would you rather have annoyed users with a slightly less performant system 99.99% of the time or plan for a widespread issue affecting all system the other 0.01%? You’re probably already spending your energy upgrading from legacy apps to make your workstations more like cattle.
All in trying to get at here with this long winded counterpoint - this isn’t an easy problem to solve. I’d love to see the day that IT shops are valued enough to get the budget they need informed by the local experts, and I won’t deny that “C-suite went to x and came back with a bad idea” exists. In the meantime, I think we’re all going to instead be working on ensuring our update policies have better controls on them.
As a closing thought - if you audited a vendor that has a product that could get a system back online into low level recovery after this, would you make a budget request for that product? Or does that create the next CrowdStruckOut event? Do you dual-OS your laptops? How far do you go down the rabbit hole of preparing for the low probability? This is what you have to think about - you have to solve enough problems to get your job done, and not everyone is in an industry regulated to have every problem required to be solved. So you solve what you can by order of probability.
I upvoted because you actually posted technical discussion and details that are accurate. PXE and remote power management is the way. Most workstation BIOS will have IPMI functionality already included. I agree thought that being that these are remote endpoints, it can be more challenging. Having a script to reboot their endpoints into a recovery environment though would be a basic step though in any DR scenario. Mounting the OS partition to delete a file & reboot wouldn’t be a significant endeavor, although one that they’d need to make sure they got right. Still though, it would be hard to mess up for anyone with intermediate computer skills… and you’d hope these companies at least have someone trained to do that rather quickly. They’d have to spend more time writing up a CR explaining all the steps, and then joining a conference call with like 100 people with babies crying in the background… and managers insisting they remain on the call while they write the script.
This doesn’t seem to be a problem with disaster recovery plans. It is perfectly reasonable for disaster recovery to take several hours, or even days. As far as DR goes, this was easy. It did not generally require rebuilding systems from backups.
In a sane world, no single party would even have the technical capability of causing a global disaster like this. But executives have been tripping over themselves for the past decade to outsource all their shit to centralized third parties so they can lay off expensive IT staff. They have no control over their infrastructure, their data, or, by extension, their business.
Issue is definitely corporate greed outsourcing issues to a mega monolith IT company.
Most IT departments are idiots now. Even 15 years ago, those were the smartest nerds in most buildings. They had to know how to do it all. Now it’s just installing the corporate overlord software and the bullshit spyware. When something goes wrong, you call the vendor’s support line. That’s not IT, you’ve just outsourced all your brains to a monolith that can go at any time.
None of my servers running windows went down. None of my infrastructure. None of the infrastructure I manage as side hustles.
Man, as someone who’s cross discipline in my former companies, the way people treat It, and the way the company considers IT as an afterthought is just insane. The technical debt is piled high.
And you probably paid less to not have that happen as well!
I’ve seen the same thing. IT departments are less and less interested in building and maintaining in-house solutions.
I get why, it requires more time, effort, money, and experienced staff to pay.
But you gain more robust systems when it’s done well. Companies want to cut costs everywhere they can, and it’s cheaper to just pay an outside company to do XY&Z for you and just hire an MSP to manage your web portals for it, or maybe a 2-3 internal sys admins that are expected to do all that plus level 1 help desk support.
Same thing has happened with end users. We spent so much time trying to make computers “friendly” to people, that we actually just made people computer illiterate.
I find myself in a strange place where I am having to help Boomers, older Gen-X, and Gen-Z with incredibly basic computer functions.
Things like:
- Changing their passwords when the policy requires it.
- Showing people where the Start menu is and how to search for programs there.
- How to pin a shortcut to their task bar.
- How to snap windows to half the screen.
- How to un-mute their volume.
- How to change their audio device in Teams or Zoom from their speakers to their headphones.
- How to log out of their account and log back in.
- How to move files between folders.
- How to download attachments from emails.
- How to attach files in an email.
- How to create and organize Browser shortcuts.
- How to open a hyperlink in a document.
- How to play an audio or video file in an email.
- How to expand a basic folder structure in a file tree.
- How to press buttons on their desk phone to hear voicemails.
It’s like only older Millennials and younger gen-X seem to have a general understanding of basic computer usage.
Much of this stuff has been the same for literally 30+ years. The Start menu, folders, voicemail, email, hyperlinks, browser bookmarks, etc. The coat of paint changes every 5-7 years, but almost all the same principles are identical.
Can you imagine people not knowing how to put a car in drive, turn on the windshield wipers, or fill it with petrol, just because every 5-7 years the body style changes a little?