Yesterday the internet broke - here's what happened.

A number of our customers will have noticed, Timetastic was "down" for an hour or more on Monday, 24th June 2019.

Without going into the full technical blurb, the essence of what happened was an internet service provider in Pennsylvania used a piece of software which incorrectly advertised to the entire internet that a stack of efficient routes were available to deliver traffic. But it was untrue.

What happened was akin to routing the entire M6's traffic through a tiny village, it didn't work out too well.

What happened?

Cloudflare, one of the world's largest cloud network platforms were a victim of what's known as a BGP Route Leak. Timetastic were not alone in suffering, so too did thousands of other websites and applications such as Amazon, Linode, Discord.

It's not Cloudflare's fault in any way. They, like us, were a victim of bad configuration elsewhere on the internet. They did exceptional work to get the issue fixed in a timely manner and we hope their advice is heeded by the parties responsible.

Our monitoring identified the issue within seconds of it starting and we were following Cloudflare's status updates closely.  We estimate up to 50% of Timetastic's users may have been affected at some point.

Cloudflare themselves have written an depth technical explanation of what happened: https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/

A Timeline of events is available here: https://www.cloudflarestatus.com/incidents/46z55mdhg0t5

What could we have done?

We use Cloudflare as part of our security setup, to protect your data and our infrastructure - so the Cloudflare problem became OUR problem.

Some "advice" on Twitter was circulating suggesting turning Cloudflare's protection off during this outage. We're fairly certain this would have brought Timetastic back online as soon as we'd done so, but slackening off security is not something to be taken lightly.

So we didn't turn Cloudflare off, we chose to keep the setup as is until the underlying issues were resolved upstream. Here's why:

  • Turning off Cloudflare exposes our Azure servers, making us more vulnerable to hackers attacking us. This was our main reason for choosing Cloudflare in the first place.
  • Cloudflare's Web Application Firewall blocks thousands of direct threats a month to Timetastic. We have other mitigations in place and Cloudflare is by no means our only security measure, but it's our first line of defence.
  • We rely on Cloudflare's load balancing to keep Timetastic up and running for everyone - if we disable the load balancer we are much more vulnerable to network issues like this one.  
  • Caching: Cloudflare caches a significant amount of Timetastic data (over 300GB a month) and although we can cope without that extra layer, it'd make the app significantly slower for everyone that wasn't affected.

We hold our own standard to security and our users' trust in very high regard and there's absolutely no way we were going to expose our origin servers unless we had very good reason to do so.  A partial outage is not a good reason to do so, as it'd effectively be holding an "attack us" flag up to the whole internet.  🙌

What are you doing to prevent this?

We're exploring options here and it's too early to say if we'll definitely make any big architectural changes. The team are discussing whether we can implement a redundant set of technologies under Azure to take over from Cloudflare if the need ever arises. Our initial assessment is this might be a lot of extra complexity - and potential points of failure - for a rare occurrence but we are discussing it.  If we do make any significant changes, we'll share them here or on Twitter.

Our monitoring and logging responded within seconds to tell us there was potentially an outage, but improvements could be made to pinpoint where that outage is in the chain of services. We'll be improving this in the future so we can pinpoint the cause of an issue faster.

We don't like Timetastic being unavailable and we've taken many steps to ensure availability and speed wherever you are and whenever you need to book time off. In this instance we were unfortunate victims of a wider internet outage and we're sorry if this caused you issues.

Photo by Joey Kyber on Unsplash