March 13, 2013
A couple of weeks ago Microsoft had one of their worst days and watched as a massive outage took out Azure services globally. The outage lasted about a day and took down a huge number of sites and applications. After a few hours it became apparent that the outage was caused by an expired SSL certificate – a seemingly simple cause, that had massive impacts. I wrote a post at the time that was very strongly worded – this post was mistaken by many people as a comment on any outage. People rightly pointed out that Microsoft is hardly the first cloud provider to suffer and outage, and will undoubtedly not be the last.
What people didn’t understand however was that my comment didn’t relate to the outage in particular, but rather the root cause of it. Cloud systems are highly complex and made up of a myriad of component parts – sometimes failures in those parts, or in the connections between them, occur. The key thing here is to adopt an approach that moves a provider towards anti-fragility, the concept where IT organizations increasingly learn from their issues to improve the ultimate reliability of the services they provide. So outages aren’t the problem per se, but when a company has multiple outages that stem from the same root cause, alarm bells should start ringing.
Microsoft finally finished an analysis of the outage and has published a post detailing what occurred, why it occurred and steps they’re taking to ensure the same thing won’t happen again. Some key points from the post are quoted below:
Windows Azure uses an internal service called the Secret Store to securely manage certificates needed to run the service. This internal management service automates the storage, distribution and updating of platform and customer certificates in the system. This internal management service automates the handling of certificates in the system so that personnel do not have direct access to the secrets for compliance and security purposes.
The certificates in question originate from this secret store and are stored locally on each of the Azure storage nodes. The certificates in question all expired within three minutes of each other – when this time was reached, the certificates became invalid and all connections using HTPS with storage were rejected (HTTP was still operational obviously). The interesting thing here, of real concern to Azure customers, is the root cause behind these certificates not being updated. Further, the fact that these certificates were the same across all regions is another concern.
For context, as a part of the normal operation of the Secret Store, scanning occurs on a weekly basis for the certificates being managed. Alerts of pending expirations are sent to the teams managing the service starting 180 days in advance. From that point on, the Secret Store sends notifications to the team that owns the certificate. The team then refreshes a certificate when notified, includes the updated certificate in a new build of the service that is scheduled for deployment, and updates the certificate in the Secret Store’s database. This process regularly happens hundreds of times per month across the many services on Windows Azure.
In this case, the Secret Store service notified the Windows Azure Storage service team that the SSL certificates mentioned above would expire on the given dates. On January 7th, 2013 the storage team updated the three certificates in the Secret Store and included them in a future release of the service. However, the team failed to flag the storage service release as a release that included certificate updates. Subsequently, the release of the storage service containing the time critical certificate updates was delayed behind updates flagged as higher priority, and was not deployed in time to meet the certificate expiration deadline. Additionally, because the certificate had already been updated in the Secret Store, no additional alerts were presented to the team, which was a gap in our alerting system.
And so the certificates expires, and the Azure storage services went down. Microsoft went into disaster recovery mode and worked through some scenarios to have service returned – to their credit they took the time to test and validate fixes to ensure that no downstream problems were caused. In situations like this it’s always tempting to take knee jerk actions in an attempt to restore service and, in doing so, caused unintended secondary outages.
Anyway – the team found a fix and deployed it – services came back up and customers could breathe easy. But very quickly the questions came – I’ve had more engagement from readers in that blog post than on almost any other post I’ve written over my career. People wanted to know how the outage happened and, more importantly, what step Microsoft was taking to ensure something like this wouldn’t happen again. The key things people want some clarity on are the prevention, of outages, and the detection and recovery in the case of outages. From the blog post:
We will be expanding our monitoring of certificates expiration to include not only the Secret Store, but the production endpoints as well, in order to ensure that certificates do not expire in production.
Our processes for recovery worked correctly, but we continue to work to improve the performance and reliability of deployment mechanisms.
We will put in place specific mechanisms to do critical certificate updates and exercise these mechanisms regularly to provide a quicker response should an incident like this happen again.
We will improve the detection of future expiring certificates deployed in production. Any production certificate that has less than 3 months until the expiration date will create an operational incident and will be treated and tracked as if it were a Service Impacting Event.
We will also automate any associated manual processes so that builds of services that contain certificate updates are tracked and prioritized correctly. In the interim, all manual processes involving certificates have been reviewed with the teams.
We will examine our certificates and look for opportunities to partition the certificates across a service, across regions and across time so an uncaught expiration does not create a widespread, simultaneous event. And, we will continue to review the system and address any single points of failure.
This is a fairly robust Mea Culpa – while one can look back and say that the outage was inexcusable (which it is) the most important thing going forwards is that things are put into place to continually improve the reliability of the service. In the case of SSL certificates, the steps that Microsoft has undertaken would appear to reduce any chance for an outage like this to occur. The bigger issue is the culture of continuous improvement. It’s fair to say that Azure has had a couple of very serious wake up calls and the depth of their investigation into this incident gives me a degree of faith that they’ll apply the learnings more broadly to not only improve their SSL certificate handling procedures, but also their approach to how they run their services overall.
At the time of writing Microsoft was suffering yet another outage, this time affecting Outlook.com and Hotmail – I’ll reserve judgement on that particular slipup for now….