Whether you’re looking at your own code before (or after!) you have shipped it, or you’re picking up someone else’s code after they have shipped it, tracking down and fixing bugs is a fundamental part of programming. If you know the code well, perhaps you can make an intuitive leap to immediately jump to where the bug is. But how do you go about tracking down a bug when intuition doesn’t help?

The nature of all code is that larger systems are built from smaller underlying systems and components. They in turn are also constructed from smaller components. The bug you are tracking down will have a cause in one of these systems, and will have symptoms that are visible in other systems. The remaining systems work fine (as far as the bug you’re looking for is concerned), and you can use this to quickly and reliably find where the bug is.

Divide your larger systems down into smaller systems at logical points, such as different server stacks, APIs, major interfaces, classes, methods and if necessary individual lines of code. Test both sides of the divide, with your tests focusing on the data that crosses the divide. If one side works as expected, the bug is not in there, and you can eliminate that side from further testing. Continue testing the remaining systems and components, which you have now isolated, by dividing those up into smaller systems and components. Keep going until you’ve reached the smallest testable system, component, unit, or lines of code that show the fault. Congratulations: you have isolated the fault.

Apart from being a strategy that allows you to work on code you’ve never seen before, this approach also has the advantage that it is evidence-based. This approach eliminates guess work, and it forces developers’ assumptions about how their code actually works in practice to be challenged. The data never lies, but be aware that it can be mis-interpreted!

The approach is iterative, and you’ll find that you’ll often go back and forth between your code and your tests, making your code easier to test and your tests have clearer and more targeted test domains and results. Fix the tests that are relevant to the bug you are tracking down, and make a list of any other issues you find along the way for you to come back and address at a later date. Stay on target, and park potential tangents and distractions for another time.

Although this sounds like a slow process when described on paper, with practice it can be executed at high speed during an emergency situation. However, the need to restore service in a timely manner isn’t always compatible with this approach, and you’re normally better off returning to your test environment where you can study the fault without inconveniencing your customers any further.

Comments are closed.