by Matt Roberts Timetastic downtime - 20th Feb
On the 20th February, 2018, at about 20:10, Timetastic had a problem, and was out of action until just after 22:00. We're very sorry about this, it's the last thing we want for our customers.
This blog post goes into detail about what happened, and what we are going to do to help reduce the liklihood of it happening again.
We host Timetastic in Microsoft's Azure cloud. At 20:12, they started experiencing problems in their "UK South" region, which is what Timetastic uses. This took our database out of action. The following is taken from the Azure status page:
Summary of impact: Between 20:12 and 21:50 UTC on 20 Feb 2018, a subset of customers in UK South may have experienced difficulties connecting to resources hosted in this region. Impacted services included Storage, Virtual Machines, Azure Search and Backup. Some Virtual Machines may have experienced unexpected reboots.
Preliminary root cause: Engineers continue to investigate a potential power event that occurred in the region, impacting a single storage scale unit.
Mitigation: The impacted storage scale unit automatically recovered.
Next steps: Engineers will continue to investigate to establish the detailed root cause, and the full root cause analysis report will be posted on this Status History page and in the Azure Service Health blade of customers' management portals once completed.
After it was clear that the problem wasn't going to resolve very swiftly, we switched the Timetastic database to use our secondary failover. As soon as this operation was complete, we were able to bring everything back online again.
Mitigation Strategy / Lessons learned
We were able to sucessfully switch (failover) to our database backup hosted in another region. However, it took a while to a) Identify that the failover was going to be required, and b) for the failover to complete.
We're working with Microsoft to determine a more efficient solution to this - so that in the event that Azure has a similar issue, we can failover in a much smaller time window - ideally with no downtime at all!
Once again, we're sorry about this if anyone was affected.