small medium large xlarge

A Troubleshooter’s Tale

Three Principles and a Big Ball of Mud

by Mike Hostetler

Generic image illustrating the article
  Sometimes fixing the problem is the last thing you should do.  

Once I was put in charge of a big behemoth of an application. There were no unit tests, no actual design and, really, no sign of any sort of craftsmanship. Just quick fix upon quick fix upon quick fix. It was nothing but a Big Ball of Mud.

I was told that there was one particular recurring error that the users complained about occasionally. It was a data-quality issue, and the previous developer gave me instructions on how to fix it manually. Sure enough, I soon got a phone call from a user who said the problem had happened again. Before hanging up, the user said off-handedly, “I don’t know why our software never works!”

Me: “What do you mean? I thought this only happens once in a while.”

User: “No! This happens all the time!”

So I apologized profusely and did the quick fix for his specific problem. Then I cleared my schedule and dedicated at least the rest of my day on this issue.

Reproducing it was easy. That only took a few minutes. And then I dug into the code to figure out what was going on.

Getting through the muck inside this application was difficult to say the least, but I finally found what was causing it. Oddly enough, there was a large comment in that section of code, discussing the exact problem I was looking at. They had a fix, but the comment said that there had to be a better way and that “we will fix it soon.” The date on that comment was four years ago.

So I put a log statement in that code, ran what I needed to in order to reproduce the problem, but nothing—no log message! I studied the code a bit more and found, in another section, the same code (sans comment) that was copied and pasted in. I put my log message in there and—bingo! We have a hit! The bug was still there, but at least I knew where it was coming from.

I then extracted the well-commented code into its own method, removed the copied part and had it call the new method. Another test showed that my new method was being called.

Now that the problem was isolated, I could finally look at the code. The cause was unbelievably simple, now that I could see it. Instead of making a copy of the object and then changing its state, it was changing the state first, and then copying it. The user was right, of course—this never, ever worked! I made the change and, presto! Things were now working.

The user was happy to hear when I told him that I had just fixed the problem, once and for all.

The story above is one of many adventures debugging or troubleshooting applications or systems of applications. The big, strange problems almost always end up being something silly that someone (usually me) forgot to do or simply overlooked. These experiences have led me to develop a simple process that helps get the problem fixed once and for all. And generally in a timely matter.

It’s a three-step process:

  1. Be able to easily reproduce the problem,

  2. find the cause,

  3. fix the problem.

It seems easy, doesn’t it? Yet the steps aren’t always easy to follow.

Easily Reproduce The Problem

If you can’t reproduce the problem easily, how can you ever be sure that the problem is fixed? And then how can you be guaranteed that your fix worked? Would you bet your job that you fixed a problem that you were not able to reproduce? Probably not.

And note that I said “easily.” If it takes you a day or even an hour to set up your application to the state where the bug occurs, then that isn’t “easy.” Sure, it can take you a day or two to do your first reproduction, but after that it shouldn’t take more than 5-10 minutes to do it again. This is important, because you will be doing this over and over again when you finally get to fixing the problem.

Sometimes disciplining yourself to start by reproducing the problem is difficult. And yes, when you are called at home at 2:30 in the morning, maybe a Band-Aid is what is needed. But when you get to the office, after getting your cup of caffeine you should try to reproduce the problem with the pre-Band-Aided code. That way you see if your Band-Aid is a permanent fix, or (hopefully) you will see something better to do.

Note that in my story above, that’s the first thing that I did. Before opening the code base, I went to see if I could reproduce the problem.

Find The Cause

People often think they know immediately what the problem is, but a lot of the time the cause is not as simple as you think. Usually you only get to see the big side-effect, not what might be the simple underlying cause. In my situation, the previous developers decided they had fixed the problem, but never went back to see if it worked. Or they just thought that it was an anomaly, never willing to dig into it. Heck, no one even seemed to ask a user if the problem was fixed!

So when I took on the task to dig into it, I spent most of my time trying to isolate the exact place where the bug was. Note that I also reproduced the problem even when I put log messages in—just so I knew what was executing.

Fix The Problem

Finally, you can fix the problem. It’s amazing how many developers jump straight to this step, yet it’s really the last thing you do. You make an attempt at a fix, and then you try to reproduce the problem again. If it’s fixed, great. If not, start figuring out why and make more changes, then try it again. It’s an iterative process—keep reproducing the problem and changing code until it goes away.

When you finally have it fixed, you should then start testing other parts of your code to see if you broke anything else. Unit tests are wonderful for this—just run your suite again.

I said that it’s common to jump right into fixing the problem, but some developers really don’t want to fix the problem—they like to blame the complier, the application server, the OS, some library they are using. This is a cop-out. There are plenty of people using the same tools you are and they manage somehow. And, while these tools may have bugs in them, that’s not something you can control—you only have control of your application. So yes, you may have to work around some flaky library or some weirdness with the application server. That’s part of your job.

One more note on fixing the problem: it may not involve changing any code. Sometimes it’s a configuration issue, or some application you depend on has a problem that the maintainers don’t know about.

To the user yelling for a problem to be fixed, it may seem that it ought to be easy to dig to the bottom of these application issues, but we all know that it isn’t. But if you use the iterative process of reproduce, find, and fix, you will find it much easier to solve the problem and satisfy the user.

Mike got his first computer at the age of 11 and hasn’t looked back since. After many years in the trenches of technical support and QA testing, he has found himself working as a software developer for a Fortune 500 company in Omaha, NE. When not at his day job, Mike does freelance work specializing in open source solutions for small business. He can be contacted by email at

Send the author your feedback or discuss the article in the magazine forum.