Good Engineers Perform Root Cause Analysis

As software engineers our jobs is to solve problems, but are you solving the right one?

Good Engineers Perform Root Cause Analysis

Hey Friends 👋,

Most of us like to think we are pretty good at solving problems. After all, as software engineers that is what we are paid to do.

In some cases, however, it is very difficult to work out whether you are solving the correct problem. Before solving any problem it is important to take a step back and ask yourself what could be causing the problem and whether you are actually solving the root cause.

Let's take a non-technical example from my personal experience.

My wife suffered from Fibromyalgia for over 10 years. For those not aware it causes all over chronic body pain, making life pretty difficult and miserable. In some cases it was so bad my wife couldn't get out of bed.

Fibromyalgia is the name of the condition, but it really is just a catch-all term for chronic pain that has no known cause. This is what it says on the NHS website:

The exact cause of fibromyalgia is unknown, but it's thought to be related to abnormal levels of certain chemicals in the brain and changes in the way the central nervous system (the brain, spinal cord and nerves) processes pain messages carried around the body.

The "solution" prescribed by the doctor for fibromyalgia is very strong painkillers that can also make you fall asleep. Great...

My wife and I noticed though that when pregnant all her fibromyalgia pains disappeared but came back a few months after giving birth.

When my second child was born we quickly realised that she had quite a severe milk allergy. We ended up cutting dairy completely from our diet which was a shame as my wife used to love cheese.

However, giving up cheese was worth it as my wife's fibromyalgia was gone. Occasionally my wife gets a craving for something that has dairy in it and if she eats too much the pain comes back.

It turns out consuming too much calcium can interfere with the absorption of magnesium which can cause all over body pain. Growing a baby uses a lot of calcium, so this was why she didn't get pain during pregnancy. Solving the root cause saved my wife from a lifetime of pain and unnecessary medication.

Not performing root cause analysis and just treating the symptoms is often the quickest way to deal with a problem. That problem however will come back, and it is going to cost you more in the long run then spending time to solve the issue properly in the first place.

I see these sorts of problems come up a lot in software, and it is too easy to take the easy path.

PROBLEM: Application keeps running out of memory and crashing.

EASY SOLUTION: Increase the amount of memory on the server.

PROBLEM: Database keeps reaching 100% CPU.

EASY SOLUTION: Migrate to a more powerful server instance.

PROBLEM: API call keeps timing out.

EASY SOLUTION: Increase the timeout.

All of these solutions are just treating the symptoms without actually treating the root cause.

If your application keeps running out of memory then you may have a memory leak or an inefficient loop.

A database that keeps reaching 100% CPU is likely due to a badly written query that needs to be optimised.

Not solving the root cause will just cause the problem to come back, especially in a rapidly scaling system.

Chances are it will come back at 1am when you are on call. So it is in your interest to get to the root of the problem.

How to perform Root Cause Analysis

Hopefully I have convinced you that doing root cause analysis is worth the effort, so how do we actually do it?

1. Define the problem

The first step is to take a proper look at the problem. Write down all the details and what symptoms it is causing. This should be detailed enough that you could give it to someone else, and they would be able to understand the problem.

2. Gather data

Next, we need to gather as much data as we can about the problem and the space that it is in.

- When does the problem occur, is it at a regular schedule?

- Is there anything else running at the same time?

- Is the problem gradually getting worse?

- What is the surrounding infrastructure like?

3. Make sense of the data

Once you have all the necessary data you need to map out the possible scenarios and try and see if the data correlates anywhere.

Is there a maintenance job that runs at 2 am that could be causing the slow database responses at the same time?

Are timeouts only occurring at certain points of the day or under a certain load?

Just like used to do in science, come up with some theories as to why the problem might be occurring based on your data.

4. Test out your theories

Once you have some theories as to why the problem is occurring you need to test them. These experiments will help you determine if your theories are correct or not.

For example if you suspect a maintenance job is causing the 2 am slow-downs. What happens if you delay that job by 30 mins? Does the slow-down get delayed as well?

If possible try and replicate your scenario with a repeatable test such as a unit test or an integration test. When I find bugs I always write a test first if I can, and then you know you have fixed it when the test passes.

5. Document your analysis

The last step is important not only for other people but for you as well. Write down everything you did to solve this problem. Chances are in a few months time you will remember seeing a similar problem but might not remember how you solved it.

These documents can be a great way for others to learn how to problem-solve as well. I have had problems before that have needed me to `sh` into a Docker container and run commands manually to see the output. These are skills that other engineers might not know how to do.

If you find you are struggling to find the root cause try the 5 Why's approach. Like a kid asking their parent why over and over again, do the same thing about your problem and see if you can get to the root of it.

This newsletter is free for everyone, but if you would like to support my work and my YouTube channel you can do so by becoming a patron on Ko-Fi.

All for less than a Pumpkin Spiced Latte ☕️🎃.

❤️ Picks of the Week

📝 Article - How two photographers captured the same millisecond in time. Nothing to do with tech, but I found this fascinating. 2 photographers completely unaware of each other managed to take nearly identical photos and pick that one photo out of all the ones they took to showcase.

🎬 Video - A Hackers' Guide to Language Models. If you are feeling left behind from the AI revolution then this is great video to get started. It is a bit of a long one but very interesting.

🛠️ Tool - Bottlerocket. This looks really interesting if you are hosting your own docker containers. This is a minimal, immutable linux OS specifically designed for hosting docker containers.

📝 Article - iCloud Drive Silently Deletes Your Content. I have been burned by this before. If you open a file up on your phone from iCloud it doesn't always download the latest copy. In some cases just viewing a file such as a spreadsheet can be considered a change, and therefore you can end up overwriting your files.

📝 Article - Blocking Visual Studio Code embedded reverse shell before it's too late. If you don't plan on using the reverse tunnel feature of VS Code then it is worth disabling it otherwise it could be a potential security vulnerability. I did this by adding the following URLs to my Pi-Hole block list:


🎬 Video - Lecture, Fall 2023, MIT 6.5940. Some great free lectures on machine learning.

💬 Quote of the Week

Beautiful code is short and concise, so if you were to give that code to another programmer they would say, "oh, that's well written code." It's much like as if you were writing a poem.

From Deep Work (affiliate link) by Cal Newport.