- cross-posted to:
- world@lemmy.world
- cross-posted to:
- world@lemmy.world
Fault in CrowdStrike caused airports, businesses and healthcare services to languish in ‘largest outage in history’
Services began to come back online on Friday evening after an IT failure that wreaked havoc worldwide. But full recovery could take weeks, experts have said, after airports, healthcare services and businesses were hit by the “largest outage in history”.
Flights and hospital appointments were cancelled, payroll systems seized up and TV channels went off air after a botched software upgrade hit Microsoft’s Windows operating system.
It came from the US cybersecurity company CrowdStrike, and left workers facing a “blue screen of death” as their computers failed to start. Experts said every affected PC may have to be fixed manually, but as of Friday night some services started to recover.
As recovery continues, experts say the outage underscored concerns that many organizations are not well prepared to implement contingency plans when a single point of failure such as an IT system, or a piece of software within it, goes down. But these outages will happen again, experts say, until more contingencies are built into networks and organizations introduce better back-ups.
The issue, in this case, is more about Crowdstrike’s broad usage than Microsoft’s. The update that crippled everything was to the Crowdstrike Falcon Sensor software, not to the OS.
Funnily enough, they had a similar issue with an update to the Linux version of the software a few months ago, that didn’t have these broad-reaching consequences largely due to the smaller Linux user base. Which means this is starting to look like a pattern, and there are going to need to be some serious process changes at Crowdstrike to prevent things like this in the future.
Anybody’s guess if those changes happen or not.
The real surprise to me is not the software, company or OS issues, but rather so many companies just blindly pushing untested updates to their prod environments, this was and will continue to be a risk associated with anything they do trust so implicitly. Feels like the security folks just totally failed Dev 101.
I know in at least some of the BSOD cases that it was an automatic update that wasn’t possible to delay. An acquaintance of mine told me that they have previously complained to their IT support about the disruption of auto-updates at inopportune times, but IT said it’s out of their hands for security updates because of regulatory requirements.
Or the security folks are doing the best they can with a shoestring budget.