By now many people will have heard about the github service outage triggered by an accidental drop of their database.
A few people on Twitter started attacking the github people for making such a silly mistake, which is actually quite sad – from my perspective, it was handled professionally and openly.
It’s impossible to work in IT without making some mistakes from time to time, even I’ve done stupid things like deleting the wrong partition or accessing the wrong host. I even once rebooted the bastion server late one evening thinking it was my local workstation shell.
The point is, even professionals make mistakes. The difference however between a professional and an amateur is how the mistake is handled, resolved and communicated.
An amateur will try and hide the issue or panic and run around in chaos trying to recover. However what we’ve seen from github has been:
- Honest mistake happens to an engineer who probably almost died from horror once realising what they did.
- Prompt determination of issue and restoration from a *working* backup system.
- Clearly organised and prepared staff with some form of disaster recovery plan.
- Open and honest communication with users about the issue.
- And most importantly – they detailed how they are going to prevent this from ever occurring again.
Nobody can fault them for this – things happen – SANs can die, a bug can cause incorrect drops, an admin can run the wrong command – hardware, software and people all suffer faults and mistakes from time for time.
We should be congratulating them on such a well handled disaster recovery, if anything this would make me want to use github more after seeing their handling of the issue.
There’s also a few ideas floating around I want to clarify:
- “Clearly they don’t have database replication, this would have stopped it” – No, it wouldn’t – if you replicate a drop query, it’s going to drop from the slaves as well. Even if they’re asynchronous, an async query can still transfer pretty bloody fast, which is really want you want – it’s best to have the slaves as uptodate as possible.
- “A three hour outage is totally unacceptable for a site like github” – A database the size of theirs isn’t a small task – restoring off media and importing back into the DB may take only seconds for your 5MB blog, but it could takes days for a huge multi GB site. 3 hours is pretty bloody good.