By now many people will have heard about the github service outage triggered by an accidental drop of their database.
(read their blog post on the issue here)
A few people on Twitter started attacking the github people for making such a silly mistake, which is actually quite sad – from my perspective, it was handled professionally and openly.
It’s impossible to work in IT without making some mistakes from time to time, even I’ve done stupid things like deleting the wrong partition or accessing the wrong host. I even once rebooted the bastion server late one evening thinking it was my local workstation shell.
The point is, even professionals make mistakes. The difference however between a professional and an amateur is how the mistake is handled, resolved and communicated.
An amateur will try and hide the issue or panic and run around in chaos trying to recover. However what we’ve seen from github has been:
- Honest mistake happens to an engineer who probably almost died from horror once realising what they did.
- Prompt determination of issue and restoration from a *working* backup system.
- Clearly organised and prepared staff with some form of disaster recovery plan.
- Open and honest communication with users about the issue.
- And most importantly – they detailed how they are going to prevent this from ever occurring again.
Nobody can fault them for this – things happen – SANs can die, a bug can cause incorrect drops, an admin can run the wrong command – hardware, software and people all suffer faults and mistakes from time for time.
We should be congratulating them on such a well handled disaster recovery, if anything this would make me want to use github more after seeing their handling of the issue.
There’s also a few ideas floating around I want to clarify:
- “Clearly they don’t have database replication, this would have stopped it” – No, it wouldn’t – if you replicate a drop query, it’s going to drop from the slaves as well. Even if they’re asynchronous, an async query can still transfer pretty bloody fast, which is really want you want – it’s best to have the slaves as uptodate as possible.
- “A three hour outage is totally unacceptable for a site like github” – A database the size of theirs isn’t a small task – restoring off media and importing back into the DB may take only seconds for your 5MB blog, but it could takes days for a huge multi GB site. 3 hours is pretty bloody good.
The other thing to note is that if you rely solely on a business for your business (whether that be ‘business’ or just a hobby), then you’re just as guilty than them. Always have a backup.
The neat thing about git is that your local workstation is a backup itself… so why are people complaining?
Agreed. When I saw their blog post I almost couldn’t believe what I was reading. It was so transparent and to the point, no waffle trying to hide their mistake or pass the blame. Was very impressed. Just feel sorry for the guy who caused it… Regardless, still full confidence in those guys and respect++
Rolls Royce handle the exploding engine in much the same way. They investigated, discovered the source of the problem (a minor flaw in a rather small part of the engine that allowed oil to leak) and fessed up. The engine still exploded and no one is just waving it off saying mistakes happen or calling the critics amateurs.
Apparently only having an ambulance at the bottom of the cliff is fine for IT and it’s amateurish to demand better.
Sure, I could’ve been clearer that it was your reply that pissed me off, rather than the hilarity of some site that advertises 100% uptime dropping its live database.
There’s a bit of a difference between a database dropping for a couple of hours vs an engine exploding on an aircraft and killing hundreds of people.
And there is a big difference with the Rolls Royce situation, there’s already claims that they knew about the fault but hadn’t fixed their existing engines, I don’t believe it’s a fair comparison.
Oh BTW, I wasn’t calling you an amateur, I’m not sure where you’re getting that from Simon.
However your rant last night about how such a failure was completely unacceptable is just silly, I’d like to talk to you after you’ve had 10 years professional experience in IT and take a look at your flawless record.
Mistakes will happen, it’s not ideal, but it’s fact of life. The most important part is how the response to those mistakes occurs.
It’s pretty amazing the free pass that GitHub appears to be getting from the development community.
I say “appears” because, while there are a lot of sympathetic blog posts like this floating around right now, a lot of development managers are taking a hard look at their GitHub subscriptions and wondering if outsourcing their single most critical service to a third party was such a brilliant idea.
@Jack’s comment that “..your local workstation is a backup itself… so why are people complaining?” is classic. Don’t complain because we benefit from the distributed nature of git itself!? I’ll give Linus that particular credit, not the company that has built and sells a whole lot of services around git. Get it right: people are complaining because a critical service that they paid for went offline due to easily avoidable mistakes.
More than a business, GitHub are custodians of the primary intellectual property output of many thousands of developers. Our code is us, and that is an enormous amount of trust for a developer to place in a company. In this industry, trust is everything.
GitHub of all people should have known better – which in my book is the very definition of amateurism. They’ve been open and honest about the amateur mistakes that were made, but amateur mistakes they remain. And it’s far too soon to tell if it will be handled professionally – that test only comes when they are challenged to do better, rather than being gifted a get out of jail free card.
Consider that a less professional company might have lied to try and save face, eg claim there was a SAN or hardware fault, a server fire or some other excuse to avoid responsibility.
If the fault happened a second time, I’d be right up there with you saying that it’s unacceptable since a company should learn from it’s mistakes, not repeat them.
Mistakes like that remind me of two packages that I now install on all of my servers:
molly-guard: prevents accidental reboots of the wrong server by wrapping the reboot and shutdown commands and asking you to confirm (i.e. type) the name of the host you want to reboot.
safe-rm: prevents accidental deletion of important files by wrapping the rm command and checking against a blacklist of files/directories that should never be deleted.