Timetastic downtime - 20th Feb

On the 20th February, 2018, at about 20:10, Timetastic had a problem, and was out of action until just after 22:00. We're very sorry about this, it's the last thing we want for our customers.

This blog post goes into detail about what happened, and what we are going to do to help reduce the liklihood of it happening again.

What happened?

We host Timetastic in Microsoft's Azure cloud. At 20:12, they started experiencing problems in their "UK South" region, which is what Timetastic uses. This took our database out of action. The following is taken from the Azure status page:

Summary of impact: Between 20:12 and 21:50 UTC on 20 Feb 2018, a subset of customers in UK South may have experienced difficulties connecting to resources hosted in this region. Impacted services included Storage, Virtual Machines, Azure Search and Backup. Some Virtual Machines may have experienced unexpected reboots.

Preliminary root cause: Engineers continue to investigate a potential power event that occurred in the region, impacting a single storage scale unit.

Mitigation: The impacted storage scale unit automatically recovered.

Next steps: Engineers will continue to investigate to establish the detailed root cause, and the full root cause analysis report will be posted on this Status History page and in the Azure Service Health blade of customers' management portals once completed.

After it was clear that the problem wasn't going to resolve very swiftly, we switched the Timetastic database to use our secondary failover. As soon as this operation was complete, we were able to bring everything back online again.

Mitigation Strategy / Lessons learned

We were able to sucessfully switch (failover) to our database backup hosted in another region. However, it took a while to a) Identify that the failover was going to be required, and b) for the failover to complete.

We're working with Microsoft to determine a more efficient solution to this - so that in the event that Azure has a similar issue, we can failover in a much smaller time window - ideally with no downtime at all!

Once again, we're sorry about this if anyone was affected.

Rate limiting the Timetastic API

Usage of the API has increased recently, so to keep things running nice and smooth for everyone we need to introduce a rate limit

What's the limit?

Quite simple - 5 per second per customer API key. Our logs show that most consumers should be fine with this, but if you've written code that fires multiple API calls at the same time, then you may want to change that to work sequentially, so the requests go out one at a time.

What happens if I hit the limit?

You'll get a 429 status code result and the request will fail. You'll also get some information in the body of the response telling you what the current rate limits are.

Might this change in the future?

It's possible that this limit may be reviewed and changed in the future. You can find the latest rate-limit information by logging into Timetastic and heading to the API page. You can also review the response headers we send back when you call the API - in particular, we pass back "X-Rate-Limit-Limit" and "X-Rate-Limit-Remaining" which detail the current rate limit period (5s for 5 seconds), and how many calls you have remaining for that period.

When does it come into force?

We'll be activating the API limit on the 20th February 2018.