Login

**Nick** · 21-07-2024, 09:05 AM

(21-07-2024, 08:37 AM)Radio Fixer Wrote: Sorry, my text wasnt clear.

please replace fix with upgrade

Why put this out world wide without trying it small scale first ?

It would have been tested prior to being made public. Screwups like this one mostly happen when humans are involved in the process. The Facebook BGP router configuration debacle is another example - there are many others, some not widely known.

Most modern development cycles will use a continuous integration (CI) methodology where every time any sort of change is made to a development environment (code change, configuration file change etc.) a build is automatically started and if that succeeds then a QA/regression test suite is run and if THAT passes, the build may be accepted for alpha testing with a tame group of users, then beta testing and finally general release. Whenever a new failure mode is identified, a check for that will be added to the test suite. What you're aiming for here is "coverage" - the automated and other tests should test EVERYTHING, i.e. 100% coverage, however this is an unreachable ideal - all you can do is try to get as close as you can to 100% and treat the test suite as a live project in itself. Many organisations will have a completely separate QA team to create the test suites as developers are often considered "too close" to the code and may be subject to unconscious bias (assumptions) when creating tests. If you fail at any point, the cycle restarts.

It may be that this issue snuck through as it exposed a flaw not seen before and the test suite didn't catch it. Over the years teams I've run have mostly used the Atlassian tool suite to manage most of this, particularly Jira, Bamboo, Crucible & Confluence plus other tools e.g. SmartBear TestComplete which integrate with the Atlassian tools.

Many vendors with large user bases have release "channels". The "stable" channel will be what 99.9% of all users and production systems are on and they will only get updates when everybody else upstream (CI, alpha, beta etc. see below) is happy. There are sometimes "feature" channels which if you are on will allow you to get early access (at your own risk) to new functionality, then there are "alpha" and "beta" channels to which you generally have to apply to the vendor in order to be accepted (they will only want tame & qualified folk with the right infrastructure on those channels and they will have to sign NDAs etc.). Note that there are no set names for these channels and each vendor might use a different name, but you get the idea.

This whole process is called "release management" and is where CrowdStrike failed so spectacularly.

Many end-user organisations will have all their product infrastructure on the "stable" channel but will have test environments which might be on "feature", "alpha" or "beta" etc. or even on "stable" to do their own testing before "stable" is rolled out to the production systems. The more testing you do as a end-user, the more people & infrastructure you need and the more expensive your IT costs. I've been lucky in working in the financial world in various countries - money for this stuff is generally not an issue as the consequences of downtime is generally more costly. There is a law of diminishing returns though, so there have to be constraints on testing - time, money, resources, business pressures etc.

There's a fundamental tenet for most organisations: "If it ain't broke, don't fix it", so many software updates may be skipped if the release notes don't indicate something in the release that is of benefit to the organisation, e.g. a fix to a bug they've seen, or introducing a new feature that is needed.

The problem with cybersecurity is that the environment changes very very quickly - hourly or faster - so to keep on top of new threats, updates to production environments have to be issued VERY frequently, often daily or even intra-day. This makes pre-stable-release testing an interesting challenge as the time window to get stuff 100% right is small and with that sort of pressure, mistakes can happen.

EDIT: What I suspect happened was that a fairly tame change to a configuration file (which is all that was in the update that caused the issue) exposed a coding flaw in the kernel-level driver that CrowdStrike use and which is loaded early in the Windows boot process (in order to catch nasty viruses etc. - generally the earlier you can load, the better)
The flaw itself may have been there a long time and is probably something like a buffer overflow causing a stack corruption, or memory pointer being corrupted etc. I admit this is supposition, but a configuration file change which introduced a longer-than-previously-used string or similar and which in itself may seem harmless, could easily expose this type of flaw. It's a coding error, a bug, which was not caught in testing and which was not or could not be caught and handled safely by the driver in question, hence the BSOD and reboot.

Login
Username:
Password:	Lost Password?
	Remember me