What Happened and What We Did About It.
On January 1st, 2016, our hosting provider, Linode, was hit with a massive distributed denial of service (DDOS) attack across all their datacenters. Halfway through the day we determined that Linode’s network could not guarantee ShowMojo with sufficient uptime for the duration of the attack, especially when that duration was (and potentially still is) an unknown.
We initiated our full disaster recovery process. Through the night hours of January 1st we rebuilt the ShowMojo infrastructure on a new hosting provider. We transferred a complete copy of all customer data. The ShowMojo service was fully operational early in the day on January 2nd.
As for Linode, their Atlanta datacenter was under siege well into the day on January 3rd. Linode’s Atlanta datacenter is where our infrastructure had been primarily located and “ground zero” for much of the attack. Anyone interested in the gory details can visit Linode’s status page.
What Took So Long?
This was an extraordinary event. If Linode’s Atlanta datacenter had been wormholed to an alien planet we would have resolved everything faster. The loss would have been obvious and the required next steps clear.
Until this weekend, no one had contemplated the real possibility of a sustained and successful DDOS attack against a network of Linode’s scale. The initial right call was to let Linode sort it out. But that meant a delay in next steps to mitigate the issue.
We did have ShowMojo-specific DDOS defenses in place, but they were of little use in this incident. No one was ever shooting at ShowMojo. It was the data onslaught between the attackers and Linode that cut us off from our customers. The one value our DDOS defenses did provide was a clean and instantaneous switch to our new platform (instead of a slow, error-riddled DNS record migration that would have drug out for an additional two days).
What About Customer Data?
Customer data was never in danger. This was no breech, intrusion nor attempt at theft. This was an attack and attempt at do damage by blockading Linode’s datacenters from the rest of the Internet.
We have three layers of data backups in place. We leveraged our primary backup layer – a “hot” real-time copy of the database – in our transition to the new hosting provider.
Were We Really Prepared?
Yes, though one can never be prepared enough.
The attack was deliberately perpetrated on a holiday when everyone’s staffing was low. We had an emergency staffing plan in place, so this was no issue on our end.
We have a three-tier disaster recovery strategy. The simplest scenario contemplated a major failure on our primary database server (the only component without a real-time failover). That simplest scenario could be resolved within one hour. The moderate scenario contemplated the loss of the Atlanta datacenter (via wormhole, comet, hurricane, flood, etc.). That moderate scenario would have been be resolved within two to four hours. The severe scenario contemplated the loss of Linode’s entire network. That’s what happened here, and the projected resolution time was one to two days.
Perhaps we weren’t creative enough in considering all the ways Linode’s entire network might be lost to us. Linode has a dozen datacenters in the US and across the globe. We deemed loss of access to all those locations as very low likelihood (a 500 year flood, so to speak). In such a scenario (ultimately, the scenario that just occurred) we were concerned with the surety of recovery and not the lowest possible downtime.
What About Next Time?
Let’s be clear. This incident arrived from a trajectory that most anyone would not have seriously contemplated. The next incident of significance is hopefully long off and will likely ride in on some other unknown and unknowable riddle to be unravelled at that time.
One significant difference between this time and next time is that we now have a second complete and functional infrastructure at hand. A redundant infrastructure was always on our roadmap. We just got there faster than planned. Please understand, though, that a redundant infrastructure is not a silver bullet. A redundant infrastructure is expensive to maintain and susceptible to a number of foibles. Nonetheless, it will certainly reduce the likelihood and duration of any future incident of this scale.
And What About Linode?
Our intent is to continue to use Linode for one of our two infrastructures. Based on everything we know, they were the victim of a highly orchestrated and viscous act. They deserve support from their customers. Assuming Linode continues to make the right moves now and in the future, we will stand with them.