Let's blame the dev who pressed "Deploy" - by Dmitry Kudryavtsev

Mac@programming.dev · edit-2 4 months ago

Let's blame the dev who pressed "Deploy" - by Dmitry Kudryavtsev

over_clox@lemmy.world · edit-2 4 months ago

If you were a developer that knew you were responsible for developing ring zero code, massively deployed across corporate systems across the world, then you should goddamned properly test the update before deploying it.

This isn’t a simple glitch like a calculation rounding error or some shit, the programmers of any ring zero code should be held fully responsible, for not properly reviewing and testing the code before deploying an update.

Edit: Why not just ask Dave Plummer, former Windows developer…

https://youtube.com/watch?v=wAzEJxOo1ts

Aceticon@lemmy.world · edit-2 4 months ago

If you system depends on a human never making a mistake, your system is shit.

It’s not by chance that for example, Accountants have since forever had something which they call reconciliation where the transaction data entered from invoices and the like then gets cross-checked with something else done differently, for example bank account transactions - their system is designed with the expectation that humans make mistakes hence there’s a cross-check process to catch those.

Clearly Crowdstrike did not have a secondary part of the process designed to validate what’s produced by the primary (in software development that would usually be Integration Testing), so their process was shit.

Blaming the human that made a mistake for essentially being human and hence making mistakes, rather than the process around him or her not having been designed to catch human failure and stop it from having nasty consequences, is the kind of simplistic ignorant “logic” that only somebody who has never worked in making anything that has to be reliable could have.

My bet, from decades of working in the industry, is that some higher up in Crowdstrike didn’t want to pay for the manpower needed for the secondary process checking the primary one before pushing stuff out to production because “it’s never needed” and then the one time it was needed, it wasn’t there, thinks really blew up massivelly, and here we are today.

over_clox@lemmy.world · 4 months ago

Indeed, I fully agree. They obviously neglected on testing before deployment. So you can split the blame between the developer that goofed on the null pointer dereferencing and the blank null file, and the higher ups that apparently decided that proper testing before deployment wasn’t necessary.

Ultimately, it still boils down to human error.

Eager Eagle@lemmy.world · 4 months ago

Finding people to blame is, more often than not, useless.

Systematic changes to the process might prevent it from happening again.

Replacing “guilty” people with other fallible humans won’t do it.

over_clox@lemmy.world · 4 months ago

Still, with billions of dollars in losses across the globe and all the various impacts it’s having on people’s lives, is nobody gonna be held accountable? Will they just end up charging CrowdStrike as a whole a measly little fine compared to the massive losses the event caused?

One of their developers goofed up pretty bad, but in a fairly simple and forgivable way. The real blame should go on the higher ups that decided that full proper testing wasn’t necessary before deployment.

So yes, they really need to review their policies and procedures before pressing that deploy button.

Eager Eagle@lemmy.world · edit-2 4 months ago

is nobody gonna be held accountable?

Likely someone will, but legal battles between companies are more about who has more money and leverage than actual accountability, so I don’t see them as particularly useful for preventing incidents or for society.

The only good thing that might come out of this and is external to CrowdStrike, is regulation.

lad@programming.dev · 4 months ago

with billions of dollars in losses

But the real question we should be asking ourselves is “how much did tops saved over the course of the years without proper testing”

It probably is what they are concerned about, and I really wish I knew the answer to this question.

I think, this is absolutely not the way to do business, but maybe that’s because I don’t have one ¯\_(ツ)_/¯

v9CYKjLeia10dZpz88iU@programming.dev · edit-2 4 months ago

deleted by creator

Aceticon@lemmy.world · edit-2 4 months ago

Making a mistake once in a while on something one does all time is to be expected - even somebody with a 0.1% rate of mistakes will fuck up once in while if they do something with high enough frequency, especially if they’re too time constrained to validate.

Making a mistake on something you do just once, such as setting up the process for pushing virus definition files to millions of computers in such a way that they’re not checked inhouse before they go into Production, is a 100% rate of mistakes.

A rate of mistakes of 0.1% is generally not incompetence (dependes on how simple the process is and how much you’re paying for that person’s work), whilst a rate of 100% definitelly is.

The point being that those designing processes, who have lots of time to do it, check it and cross check it, and who generally only do it once per place they work (maybe twice), really have no excuse to fail the one thing they had to do with all the time in the World, whilst those who do the same thing again and again under strict time constraints definitelly have valid excuse to once in a blue moon make a mistake.

sping@lemmy.sdf.org · edit-2 4 months ago

deleted by creator

over_clox@lemmy.world · 4 months ago

Watch the video that I linked as an edit from Dave Plummer, he explains it rather well. The driver was signed, it was the rolling update definition files from CrowdStrike that were unsigned.