How to Perform Incident Post-Mortems: Identify Root Cause with “Five Whys”
Modern software is complex, increasingly so. Hardware infrastructures connect with other software platforms, all of which can fail. Even with careful design and extensive testing, incidents happen.
An incident happens any time software behaves differently than what is expected. It can be as simple as one user not being able to download a CSV file, or it can be as severe as none of the users of an application being able to login.
“The important thing to realize is that failure is going to happen. It’s not a question of if, it’s a question of when.” — Paul Hammond’s from his 2009 seminal talk with John Allspaw, “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr.”
In a DevOps culture, where continuous integration and deployment (CI/CD) of cloud-based services increased deployment frequency, the probability of incidents arising is high. It’s just a price to pay for delivering value as often as possible. Consequently, the only realistic approach to decrease downtime is to create an organized plan for incident response and management by performing post-mortems to identify the root cause of incidents.
Why You Should Invest in Incident Analysis
A production incident usually means the system is down for some time. But downtime means dissatisfied customers, damage to brand reputation and revenue loss. For the Fortune 1000 companies, the average total cost of unplanned application downtime is $1.25B to $2.5B per year.
But instead of panicking about the inevitability of incidents and their cost to your organization, try looking at them under a different light: incidents are learning opportunities.
Analyzing them will reveal insights about the organization that you might not have realized before. It will uncover relevant information about your team’s processes, and contribute to a better understanding of what is failing. Therefore, analyzing incidents lays the foundations to prevent the same from happening again. You can learn from the past to build a better future.
If you feel sure about an incident’s cause, you might feel tempted to skip a formal analysis. But you should still consider following through with the process — it will allow other people in your team to have a picture as clear as yours, therefore impacting their future contribution to the team and the customers. You might even want to share your findings with the customers. More than a guilt admission, this is a way to build trust amongst your customers.
Why You Should Figure Out The Root Cause of An Incident
In a fast-paced environment, tools like version control and continuous delivery make it easy to “undo” an incident. Often, incidents happen when a bug is pushed into production and rolling back that change can quickly revert the situation. While this is helpful for the teams and gets the service working correctly again in a short amount of time, it doesn’t provide any intelligence on why the incident happened in the first place.
In order to generate learning opportunities, any incident management process should include 5 phases:
It’s essential for any DevOps team to be prepared for eventual incidents. Identifying weaknesses in the system and setting up monitoring tools and system alerts helps team members know what to do when an incident is detected.
DevOps teams usually have several members on-call available for escalations. If the on-call engineers cannot solve the issue, they can bring in the right people to escalate the problem to facilitate the incident resolution.
This step is all about taking the necessary measures to fix the issue. It is when the problem is solved and the system goes back to functioning properly again.
The analysis phase of incident management is often referred to as “post-mortem” or Root Cause Analysis (RCA).
While, historically, this phase has mostly been about performing RCA, as systems grow in complexity, teams increasingly look towards models that address complexity, such as the Cynefin framework.
This is when the process comes full circle. Once an incident has been fixed and the system is restored, the team should reassess its readiness for the next incident. Ideally, everyone will be more prepared and will have learned important lessons from the post-mortem, that equip them to better deal with upcoming incidents.
How the “Five Whys” Can Help You Reach Root Cause
Let’s focus on the fourth phase of incident handling, Analysis. As mentioned, this phase is mostly about identifying the root cause of an incident.
In DevOps expert and author Kristian Erbou’s book, Build Better Software: How to Improve Digital Product Quality and Organizational Performance, Erbou takes a closer look at how to identify root cause and perform incident analysis by adopting the “Five Whys.”
Start by asking why an incident happened and keep repeating the question “Why?” to each answer, five times in a row. Here is an example:
Q: What was the error?
A: I opened the website and got an error trying to log in with my Facebook account.
Q: Why did it return an error? (first Why)
A: The Facebook login hasn’t worked since the release on February 1.
Q: Why not? (second Why)
A: The API key for our Facebook login was incorrect.
Q: Why was the API key for the Facebook login incorrect? (third Why)
A: One configuration setting was incorrect on LIVE, meaning that Facebook rejected our login requests in their API.
Q: Okay, how could that happen?
A: The wrong configuration file was copied from the developer’s workstation without anybody noticing that it was the wrong file.
Q: Why didn’t anybody notice? (fourth Why)
A: We don’t test before a release that the configuration files are the right ones for our environment.
Q: Why don’t you perform regression tests prior to a release? (fifth Why)
A: There is no process in place to ensure that only valid configuration files are copied into production.
When we say “Five Whys”, this is not a strict number. Sometimes, you’ll reach the root cause with only three “Whys”, other times you might need seven. The point is to keep asking until you reach the root cause. Only by gaining this knowledge will you be able to correct the problem that generated the whole incident. This is how you learn and truly improve incident management in your organization.
How to Perform A Post-Mortem
As mentioned, the analysis phase of incident management is often called “post-mortem.” If this is not something your teams are used to doing, introducing post-mortems to an organization can be challenging. It can quickly turn into a game of blaming and pointing fingers that benefits nobody in the end. If you are considering introducing this process into your organization, there are some guidelines you can follow:
Stay Away from Finger Pointing
This is the most crucial rule to follow when performing a post-mortem. Focusing on finding the guilty people and blaming them for what happened causes more harm than good. Instead, focus on making sure that the whole team learns from the incident and performs better next time.
Appoint a Dedicated Incident Lead
Appoint a dedicated lead whose focus is enforcing post-mortem for each incident. Having a dedicated lead responsible for handling the incident from start to finish will ensure the likelihood of all important details being captured when doing the post-mortem, thus contributing to its success.
Share your Findings
Document the knowledge acquired with each post-mortem in a way that is accessible to the whole organization, for example in a wiki page. This way, every team member and, consequently, the organization as a whole, will benefit from the lessons learned.
If you don’t have a culture of post-mortems in your organization yet, start with small steps. Not all incidents are equal. They are usually put into categories of low, moderate, and severe. The categorization is mostly related to the impacted functionality, the number of users affected and the duration of downtime. Naturally, start by analyzing severe incidents, as these cause bigger damage to your organization and your customers. As the post-mortem culture gets ingrained into your organization, you can proceed to analyze medium and low-severity incidents.
As we have seen, the DevOps culture leads to more frequent deployments which can have the unwanted consequence of generating more incidents. Therefore, you could ask why you would even want to adopt DevOps. Maybe, instead, you should just do larger deployments more often?
Depending on the type of service you provide, this could be a better option. But not in all cases. Cloud-base applications, for example, are in such a competitive landscape in software that they absolutely require you to deploy often, providing value at a regular pace, if you want to stay competitive in the market.
Either way, DevOps brings about a lot more good than harm: it lowers the cost of downtime when it happens, it eases employee burnout when on-call and, overall, it makes your customers more satisfied with the service or product you provide.
Don’t be afraid to fail. Be afraid of not learning from your organization’s failures. They will always happen but if you do your post-mortems rightly, with each new incident, your teams will be stronger and better equipped to handle them.