Microsoft Azure suffered an outage with Azure AD authentication (Americas) on Monday, September 28th 2020, due to a software change in the service, which Microsoft reported was rolled back. As with other service health issues in cloud services, Microsoft constantly ensure customers are kept up to date with service issues and remediation activities. Azure Status, is main place to visit when reviewing cloud service health across all regions. For any historical status issues, the status history is also available for view.
I’ve seen a number of articles on several websites clearly reporting the issues incorrectly and making wide assumptions on how Microsoft handles software code updates. But, it’s important to understand a few basic things before making wide assumptions…
Lets talk about outages…
This isn’t the first and last outage we will see in any cloud platform.
Azure AD has an SLA, it’s important to understand what the SLA is and why it is stated; don’t expect every cloud service to be up 100% all the time. This happened in traditional data centers, there are things that happen, the things that cause outages are investigated, they are remediated, lessons are learned and the affected service, or process which affected the service, has an opportunity to be improved as part of continual service improvement phases.
Failures occur in every industry
Unfortunately, there are failures and service outages across different industries that affect consumers on a daily basis. We can always remember things like:
The time power was lost in our houses
The time the car had a fault or a recall
The time when the internet connection dropped
How does this help?
Well, technically for the outage it doesn’t, but the one of the key takeaways from a service outage is the remediation plan. The postmortem which Microsoft provide as part of the response to the issue experienced is key to improving the service availability and governance around how the specific outage occurred can be rectified to stop similar issues occurring again.
How does this help you?
As I mentioned earlier, it can be frustrating when outages occur, but in the long run, you can expect that similar outages should not occur in the way that the previous outage manifested itself previously. You can be sure that every outage is taken seriously, remediated and you can only expect that there will be innovative phases and releases to improve upon the service in the near future, to ensure outages are kept to a minimum. I can clearly imagine that publicly, we only see a very small percentage of the communication that occurs with such incidents. To Microsoft, such incidents are a matter of paramount importance.
Whilst, customers can’t specifically design for an Azure AD outage, where many Microsoft services rely on Azure AD, one of the key points of cloud computing is to not expect everything to be up and running 100% all of the time. That said, every enterprise solution deployed to a public cloud platform should be designed for availability, with redundancy in mind utilising the correct architectural patterns for each service component.
This has all reminded me of a quote by Tom Peters;
“Almost all quality improvement comes via simplification of design, manufacturing… layout, processes, and procedures.”