Fixing Bugs When You Cant Reproduce the Problem
The most difficult bugs to fix are the bugs you can’t reproduce. Most bug reports only describe the effect of a bug, but to fix a bug, you need to know its cause. It is a lot easier to determine a bug’s cause when you can trace its execution path, and if you can reliably reproduce a bug, then you can confirm that your changes actually fix it. Sometimes, however, you will be faced with bugs which you cannot reproduce. In this post, I will recount the story of one such bug, and hopefully provide you with some tips for dealing with similar problems.
My boss recently assigned me a bug that had been reported by a few of our customers. Nobody in the company had been able to reproduce it, and I wasn’t confident that I’d do any better. However, this bug was causing our app to crash on users’ computers, so it was important to fix.
Step 1: Determine Why the Bug is Unreproducible
There are many reasons why a bug can be difficult to reproduce. It could be present only on machines with specific hardware or software, the bug description may be missing an important step, or the bug may only happen sometimes. If I could determine why no one was able to reproduce the bug, I might be able to create the conditions necessary to reproduce it.
The users had no hardware or software in common, and the app crashed on launch, before the users had a chance to do anything. I tried every way I knew to start the app, but none of them caused it to crash. It did appear, however, that the app wasn’t crashing very often.
There were only 24 crash reports for this issue, and although we had only recently released the product, I knew that there were tens of thousands of people using it. Furthermore, no user had reported the problem more than once. I did some math and estimated that the app crashed once or twice for every thousand launches.
It’s important to note that at this point, I didn’t actually know anything. I hadn’t even confirmed that there was an actual bug. Although it’s unlikely, it was possible that all the crash reports were being faked by someone. On the other hand, this issue could be widespread, but users were not opting to send us their crash reports. It was also possible that the crash reports were being caused by a different bug that was writing the same garbage data to the stack, thus creating crash reports which hid the real problem. This latter possibility was not just theoretical: I had seen crash reports like that for previous versions of this app.
However, I did have a testable hypothesis: If I launched the app 5000 times, it had a 99.3% chance of crashing at least once. To test this hypothesis, I used Frank to write a simple script that would repeatedly launch and quit the app, waiting until the app finished launching before quitting. If the app crashed ay any time, the script would terminate. I then added some logging to the app to make sure that the report’s stack trace was accurate.
The next day I found that, after launching about 1,300 times, the app crashed on launch. My hypothesis seemed to be correct. Unfortunately, it took 10 hours to cause the crash. It would take a very long time to fix this problem through trial and error.
Step 2: Deduce the Approximate Cause
Now that I understood why the bug was difficult to reproduce, I needed to understand more about the bug itself. The crash was caused by a function calling itself recursively until it exhausted all the space on the stack. Normally, these kinds of bugs aren’t too difficult to fix, but this function should not have been calling itself at all.
This function was, however, being used to override another function using mach_override, which lets you replace the implementation of one function with another, and allows you to call the original function from within its replacement. I’m not going to go into great technical detail about mach_override in this post, but I will post a follow-up for explaining how it works in more detail.
(N.B.: Using mach_override is dangerous, and not something you should do unless you absolutely have to. In this case, we needed to use mach_override to implement a feature central to our product. If you use mach_override, it’s important to know how it works. mach_override is complicated and esoteric, so if you have a problem with it, you’re unlikely to find an expert to help you. You could find yourself 80% finished with development, only to be blocked by an issue with mach_override that you can’t figure out on your own.)
In order to allow you to call the original function, mach_override returns a function pointer which will call the original function. We were saving this function pointer as a global variable, and calling it from the replacement function. The code looked something like this.
The stack trace included the exact address where replacementFunc was recursively calling itself, so I loaded up the app in lldb and disassembled replacementFunc. It turned out that reentryFunc was calling our replacement function instead of the original function. I now knew the approximate cause of the bug, but I still didn’t know why it was happening.
Step 3: Read the Source
Having access to the source code is almost essential when debugging these kinds of problems. If the crash is occurring in a proprietary framework (including Apple’s frameworks), you will probably have to disassemble or otherwise reverse-engineer the code in question. Thankfully, mach_override is open source.
After spending about an hour reading the code, I concluded there was nothing wrong with mach_override, so I turned my attention to the
handleOneTimeNotification function. There was nothing wrong with this function, but I did find something peculiar about a function that it called. It looked something like this.
This check assured that the function body would only be executed once, as long as it wasn’t called from multiple threads. Since
handleOneTimeNotification was only supposed to be called once, this seemed odd. The check might not be necessary, but it got me thinking about what would happen if
mach_override_ptr was called twice.
I concluded that if
mach_override_ptr is called more than once with the same initial function and reentry function pointer, the reentry function will call the replacement function instead of the original function. I will explain why this happens in my follow-up post.
To test if calling
mach_override_ptr twice would really cause the app to crash, I inserted a duplicate call to
handleOneTimeNotification. Sure enough, the app crashed with a stack trace identical to the users’ stack traces.
Next, I reverted my changes and added a guard to
handleOneTimeNotification to prevent it from being called twice. I also added some logging to determine how often
handleOneTimeNotification was being called. Finally, I started my script and let it run for a couple days in a VM.
In the meantime, I asked my coworker, who wrote the check in
onlyCalledFromHandleOneTimeNotification, about the history behind that check. He told me that the “one-time” notification does sometimes get sent twice when the app starts up, and that the check predates any use of mach_override in our app. There were also good reasons why we couldn’t unregister our notification handler or prevent the notification from being sent twice.
When my test finished, it showed that the app did not crash after 5000 launches. Two of those launches called
handleOneTimeNotification twice. I had proven that calling
mach_override_ptr twice caused the app to crash, and that I had fixed that problem.
This debugging process won’t fix all bugs. I was lucky enough to have access to the mach_override source code, as well as the developer who wrote the code I was working on. You have to adapt to the circumstances of the bug you’re working on, but the same general approach will go a long way towards fixing any bug: gather as much information as possible, formulate hypotheses, and test all your assumptions.