An interesting bug #

As Jonathan has blogged, "What is the hardest bug you've ever tackled?" is an interesting conversation starting point with engineers, one that I often use to start (phone) interviews. I usually rephrase it as "Describe an interesting or difficult bug that you ran into", since "hardest" often causes people to freeze up as they ponder whether the bug they have in mind is actually the hardest. In any case, most bugs become interesting if you ask "why?" enough.

Along these lines, here's a bug that I ran into in mid-2007 while I was working on Google Reader: Soon after a production push, we noticed that some users were complaining that Reader wasn't loading properly when they reloaded the page. Stranger still, others said that it wasn't working properly initially, but after a few reloads it would start working. Checking things in the office revealed similar inconsistent results: Reader would load for some but not for others. For those for whom Reader hadn't loaded successfully, it turned out to be because of a 404 that was returned when trying to load Reader's main JavaScript file.

This happened soon after Gears support was added to Reader, so we initially suspected some interaction with offline support. Perhaps an old version of the HTML was being used by some users, and that contained a link to a version of the JavaScript file that we didn't serve anymore. However, some quick Dremel-ing showed that we had never served the URLs that triggered 404s until the push began. Stranger still, not all requests for those URLs resulted in 404, only about half.

At this point a bit of background about Reader's JavaScript infrastructure is necessary. As previously mentioned, Reader uses the Closure Compiler for processing and minimization of JavaScript. Reader does runtime compilation, since it supports per-user experiments that would make it prohibitive to compile all combinations at build or push time. Instead, when a user requests their JavaScript file, the set of experiments for them is determined, and if we haven't encountered it before, a new variant is compiled and served. JavaScript (and other static resources) are served with a checksum of their contents in the filename. This allows each URL to be served with a far-future cache expiration header, and makes sure that when its content changes users will pick up changes by virtue of having a new URL to fetch.

The JavaScript URL is used in two places, once embedded as a <script src="..."> tag in the HTML, and once when requesting the file itself. The aforementioned compilation and serving steps happen once for each (identical) frontend machine, and some machines had one idea of what the URL should be, and others had a different expectation. Since the frontends are stateless, it was quite likely for users to request the JavaScript from a different one than the one they got the HTML with the URL from. If there was a mismatch, then the 404 would happen. However, if the user reloaded enough times, they would eventually hit a pair of machines that did think the JavaScript URL was the same.

I said the users were getting "seemingly" identical JavaScript, but there was actually a slight difference when doing a diff (which explained the difference in checksums). One variant contained return/^\s*$/.test(str == null ? "" : String(str)) while the other had return/^\s*$/.test((str == null ? "" : String(str))) (note the extra parentheses in the test() argument). The /^\s*$/ regular expression was distinctive enough that it was easy to map this as being the compiled version of the Closure function goog.string.isEmptySafe, which is defined as:

goog.string.isEmptySafe = function(str) {
  return goog.string.isEmpty(goog.string.makeSafe(str));
};

The goog.string.isEmpty and goog.string.makeSafe calls get inlined, hence the presence of the regular expression test and String() directly (note that the implementations may have changed slightly since 2007).

Now that I knew where to look, I began to turn compiler passes off until the output became stable, and it became apparent that the inlining pass itself was responsible. The functions would not be inlined in the same order (i.e. goog.string.isEmpty and then goog.string.makeSafe, or vice-versa), and in one case the the compiler decided to add extra parentheses for safety. Specifically, when inlining the compiler would check to see if the replacement AST node was of lower precedence that the one it was replacing. If it was, a set of parentheses was added to make sure that the meaning was not changed.

The current compiler inlining pass is very different from the one used at that point, but the relevant point here is that the compiler would use a HashSet to keep track of what functions needed to be inlined. The hash set was of Function instances, where Function was a simple class that had a couple of Rhino Node references. Most importantly, it didn't define either equals() or hashCode(), so identity/memory address comparisons and hash code implementations were used.

When actually inlining functions, the compiler pass would iterate through the HashSet, and since the Function instances corresponding to goog.string.isEmpty and goog.string.makeSafe had different addresses depending on the machine, they could be encountered in a different order. The fix was to switch the list of functions to inline to a List (Set semantics were not necessary, especially given that Function instances used identity comparisons so duplicates were not possible).

The inlining compiler pass had used a HashSet for a long time, so I was curious why this only manifested itself then. The explanation turned out to be prosaic: this was the first Reader release where goog.string.isEmptySafe was used, and there were no other places where there were nested inlineable function calls. (This bug happened around the time we switch to JDK6, which changed HashSet internals, but we hadn't actually switched to JDK6 at that point, so it was not involved).

None of this was reproducible when running a frontend locally or in the staging environment, since all those setups have a single frontend instance (they're of very low traffic). In those cases, no matter which version was compiled and which URL was generated, it was guaranteed to be serveable. To prevent the reoccurrence of similar bugs, I added a unit test that compiled Reader's JavaScript locally several times times, and made sure that the output did not change. Though not foolproof, it has caught a couple of other such problems before releases made it out into production.

The main reason why I enjoyed fixing this bug was because it involved non-determinism. However, unlike other non-deterministic bugs that I've been involved in, the triggering conditions were not so mysterious that it took months to solve.

1 Comment

great read. thanks for posting it!

incidentally, one useful technique I adopted a while back is to always run multiple instances of a server in staging, qa, etc if there are multiple in prod. same for canarying, etc. it doesn't guarantee you'll find problems like these sooner, but it's much more likely.

btw, I never realized how different your mobile site design looks. I like it!

Post a Comment