Timetastic... faster and more reliable than ever.

Good news - our growth shows no sign of abating, we now have over 120K people using Timetastic. To keep Timetastic running smoothly for all these users, over the past few months the engineers have focused on on improving speed and reliability.

Some of you may be interested in the following two items that have given us us the greatest improvements:

Cloudflare.

We've started using Cloudflare with Timetastic. Without getting too technical, Cloudflare is like a swiss army knife of features designed to help your website and web app run faster and safer. It helps deliver content quicker, and provides additional protection if hackers try to run malicious code against your app.

Load balancing.

We are now running Timetastic from two datacentres - one near London, and one near Cardiff (both operated by Microsoft's Azure cloud). In the event that something bad was to happen at one data centre, you will automatically switch to the other one, keeping everything running at all times.

We've also done the same with our database - all data is replicated (or copied) to another "backup" database in a different data centre. If something bad was to happen to the main database, we switch to the backup, again keeping everything up and running.

What can be condensed into a short blog post was no meant feat though, a lot of work goes into ensuring Timetastic continues to run fast and stable, and we hope you feel the benefit.


Timetastic downtime - 20th Feb

On the 20th February, 2018, at about 20:10, Timetastic had a problem, and was out of action until just after 22:00. We're very sorry about this, it's the last thing we want for our customers.

This blog post goes into detail about what happened, and what we are going to do to help reduce the liklihood of it happening again.

What happened?

We host Timetastic in Microsoft's Azure cloud. At 20:12, they started experiencing problems in their "UK South" region, which is what Timetastic uses. This took our database out of action. The following is taken from the Azure status page:

Summary of impact: Between 20:12 and 21:50 UTC on 20 Feb 2018, a subset of customers in UK South may have experienced difficulties connecting to resources hosted in this region. Impacted services included Storage, Virtual Machines, Azure Search and Backup. Some Virtual Machines may have experienced unexpected reboots.

Preliminary root cause: Engineers continue to investigate a potential power event that occurred in the region, impacting a single storage scale unit.

Mitigation: The impacted storage scale unit automatically recovered.

Next steps: Engineers will continue to investigate to establish the detailed root cause, and the full root cause analysis report will be posted on this Status History page and in the Azure Service Health blade of customers' management portals once completed.

After it was clear that the problem wasn't going to resolve very swiftly, we switched the Timetastic database to use our secondary failover. As soon as this operation was complete, we were able to bring everything back online again.

Mitigation Strategy / Lessons learned

We were able to sucessfully switch (failover) to our database backup hosted in another region. However, it took a while to a) Identify that the failover was going to be required, and b) for the failover to complete.

We're working with Microsoft to determine a more efficient solution to this - so that in the event that Azure has a similar issue, we can failover in a much smaller time window - ideally with no downtime at all!

Once again, we're sorry about this if anyone was affected.


Rate limiting the Timetastic API

Usage of the API has increased recently, so to keep things running nice and smooth for everyone we need to introduce a rate limit

What's the limit?

Quite simple - 5 per second per customer API key. Our logs show that most consumers should be fine with this, but if you've written code that fires multiple API calls at the same time, then you may want to change that to work sequentially, so the requests go out one at a time.

What happens if I hit the limit?

You'll get a 429 status code result and the request will fail. You'll also get some information in the body of the response telling you what the current rate limits are.

Might this change in the future?

It's possible that this limit may be reviewed and changed in the future. You can find the latest rate-limit information by logging into Timetastic and heading to the API page. You can also review the response headers we send back when you call the API - in particular, we pass back "X-Rate-Limit-Limit" and "X-Rate-Limit-Remaining" which detail the current rate limit period (5s for 5 seconds), and how many calls you have remaining for that period.

When does it come into force?

We'll be activating the API limit on the 20th February 2018.