Accidental DDoSes I Have Known #

A couple of weeks I was migrating some networking code in Quip's Mac app from NSURLConnection to NSURLSession when I noticed that requests were being made significantly more often than I was expecting. While this was great during the migration (since it made exercising that code path easier), it was unexpected: the data should only have been fetched once and then cached.

After some digging, it turned out that we had a bug in the custom local caching system that sits in front our CDN (CloudFront), which we use to serve profile pictures and other non-authenticated data. Due to a catch-22 in the cache key function (which made it depend on the HTTP response), all assets would initially not be found in the local cache, and would incur a network request. The necessary data was then stored in memory, so until the app was restarted the cache would work as expected, but in the next session they would get requested again.

Chart of CloudFront requestsIt turned out that this bug had been introduced a few months prior, but since it manifested itself as a little bit of extra traffic to an external service, we didn't notice it (the only other visible manifestation would be that profile pictures would load more slowly during app startup, or be replaced with placeholders if the user happened to be offline, but we never got any reports of that).

This chart (of CloudFront requests from “Unknown” browsers, which is how our native apps are counted) shows the fix in action; the Mac app build with it was released on November 30th and was picked up by most users over the next few days.

This kind of low-grade accidental DDoS reminded of a similar bug that I investigated a few years ago at Google, while working on Chrome Extensions. A user had reported that the Gmail extension for Chrome (which my team happened to own, since we provided it as sample code) would end up consuming a lot of memory (and eventually be terminated) if the Gmail URL that it tried to fetch data from was blocked by filtering software. After some digging it turned out that the extension would enqueue two retries for every failed failure response, due to code along these lines:

var xhr = new XMLHttpRequest();

... // Send off request

function handleError() {
    ... // schedule another request
}

xhr.onreadystatechange = function() {
    .. // Various early exits if success conditions are met
    
    handleError();
};

xhr.onerror = function() {
   handleError();
};

The readystatechange event always fires, including for error states that also invoke the error event handler. This behavior meant that it would quickly escalate from a request every few minutes to almost one request per second, depending on how long it remained in the blocked state. The fix turned out to be trivial, and since this was a separate package distributed via the Chrome Web Store that gets auto-updated, we could quickly fix the millions of users that had it installed.

It then occurred to me that this would not just affect users where the Gmail URL was blocked, but any user that had spotty connectivity — any HTTP failure would result in a doubling of background requests. I then called up a requests-per-second graph of the Atom feed endpoint for Gmail (which is what the extension used), and saw that it had dropped by 20,000 requests per second over the day or so that it took for the extension update to propagate.

The upshot of all this is that Google Reader at its peak had about 10,000 requests per second, thus making my overall traffic contribution to Google net negative.

Some Observations Regarding JavaScriptCore's Supported Platforms #

SquirrelFish

JavaScriptCore (JSC) is the JavaScript engine that powers WebKit (and thus Safari). I was recently browsing through its source and noticed a few interesting things:

ARM64_32 Support

Apple Watch Series 4 uses the S4 64-bit ARM CPU, but running in a mode where pointers are still 32 bits (to save on the memory overhead of a 64 -bit architecture). The watch (and its new CPU) were announced in September 2018, but support for the new ARM64_32 architecture was added in December 2017. That the architecture transition was planned in advance is no surprise (it's been in the works since the original Apple Watch was announced in 2015). However, it does show that JSC/WebKit is a good place to watch for future Apple ISA changes.

ARMv8.3 Support

The iPhone XS and other new devices that use the A12 CPU have significantly improved JavaScript performance when compared to their predecessors. It has been speculated that this is due to the A12 supporting the ARM v8.3 instruction set, which has a new floating point instruction that operates with JavaScript rounding semantics. However, it looks like support for that instruction was only added a couple of weeks ago, after the new phone launch. Furthermore, the benchmarking by the Apple engineer after the change landed showed that it was responsible for a 0.5%-2% speed increase, which while nice, does not explain most of the gain.

Further digging into the JSC source led to my noticing that JIT for the ARMv8.3 ISA (ARM64E in Apple's parlance) is not part of the open source components of JSC/WebKit (the commit that added it references a file in WebKitSupport, which is internal to Apple). So perhaps there are further changes for this new CPU, but we don't know what they are. It's an interesting counterpoint to the previous item, where Apple appears to want extra secrecy in this area. As a side note, initial support for this architecture was also added several months before the announcement (and references to ARM64E showed up more than 18 months earlier), thus another advance notice of upcoming CPU changes.

Fuschia Support

Googler Adam Barth (hi Adam!) added support for running JSC on Fuschia (Google's not-Android, not-Chrome OS operating system). Given that Google has its own JavaScript engine (V8), it's interesting to wonder why they would also want another engine running. A 9to5 Google article has the same observation, and some more speculation as to the motivation.

Google Reader: A Time Capsule from 5 Years Ago #

Google ReaderIt's now been 5 years since Google Reader was shut down. As a time capsule of that bygone era, I've resurrected readerisdead.com to host a snapshot of what Reader was like in its final moments — visit http://readerisdead.com/reader/ to see a mostly-working Reader user interface.

Before you get too excited, realize that it is populated with canned data only, and that there is no persistence. On the other hand, the fact that it is an entirely static site means that it is much more likely to keep working indefinitely. I was inspired by the work that Internet Archive has done with getting old software running in a browser — Prince of Persia (which I spent hundreds of hours trying to beat) is only a click away. It seemed unfortunate that something of much more recent vintage was not accessible at all.

Right before the shutdown I had saved a copy of Reader's (public) static assets (compiled JavaScript, CSS, images, etc.) and used it to build a tool for viewing archived data. However, that required a separate server component and was showing private data. It occurred to me that I could instead achieve much of the same effect directly in the browser: the JavaScript was fetching all data via XMLHttpRequest, so it should just be a matter of intercepting all those requests. I initially considered doing this via Service Worker, but I realized that even a simple monkeypatch of the built-in object would work, since I didn't need anything to work offline.

The resulting code is in the static_reader directory of the readerisdead project. It definitely felt strange mixing this modern JavaScript code (written in TypeScript, with a bit of async/await) with Reader's 2011-vintage script. However, it all worked out, without too many surprises. Coming back to the Reader core structures (tags, streams, preferences, etc.) felt very familiar, but there were also some embarrassing moments (why did we serve timestamps as seconds, milliseconds, and microseconds, all within the same structure?).

As for myself, I still use NewsBlur every day, and have even contributed a few patches to it. The main thing that's changed is that I first read Twitter content in it (using pretty much the same setup I described a while back), with a few other sites that I've trained as being important also getting read consistently. Everything else I read much more opportunistically, as opposed to my completionist tendencies of years past. This may just be a reflection of the decreased amount of time that I have for reading content online in general.

NewsBlur has a paid tier, which makes me reasonably confident that it'll be around for years to come. It went from 587 paid users right before the Reader shutdown announcement to 8,424 shortly after to 5,345 now. While not the kind of up-and-to-right curve that would make a VC happy, it should hopefully be a sustainable level for the one person (hi Samuel!) to keep working on it, Pinboard-style.

Looking at the other feed readers that sprung up (or got a big boost in usage) in the wake of Reader's shutdown, they all still seem to be around: Feedly, The Old Reader, FeedWrangler, Feedbin, Innoreader, Reeder, and so on. One of the more notable exceptions is Digg Reader, which itself was shut down earlier this year. But there are also new projects springing up like Evergreen and Elytra and so I'm cautiously optimistic about the feed reading space.

Efficiently Loading Inlined JSON Data #

I wrote up a post on the Quip blog about more efficiently embedding JSON data in HTML responses. The tl;dr is that moving it out of a JavaScript <script> tag and parsing it separately with JSON.parse can significantly reduce the parse time for large data sizes.

Understanding WebKit behavior via lldb #

I recently ran into some puzzling WebKit scrolling behavior: child iframes mysteriously causing the main window to get scrolled. This was in the context of a Quip feature still under development, but I've recreated a simple test case for it, to make it easier to follow along. There are two buttons on the page, both of which dynamically create and append an <iframe> element to the page. They convey parameters to the frame via the fragment part of the URL; one button has no parameters and the other does, but they otherwise load the same content. The mysterious behavior that I was seeing was that the code path without parameters was causing the main window to scroll down (such that the iframe is at the top of the visible area).

With such a reduced test case it may already be obvious what's going on, but things were much less clear at the time that I encountered this. There were many possible causes since we had made a major frame-related infrastructure change when this started to happen. The only pattern was that it only seemed to affect WebKit-based browsers (i.e. Safari and especially our Mac app). After flailing for a while, I realized what I wanted most of all was a breakpoint. Specifically, if I could break in whatever function implemented page scrolling, then I could see what the trigger was. Some quick monkey-patching of the scrollTop window property showed that the scrolling was not directly initiated by JavaScript (indeed the bug could be reproduced entirely without JavaScript by inlining the iframe HTML directly). Therefore such a breakpoint needed to be on the native side (in WebKit itself) via lldb.

The first task was to attach a debugger to WebKit. It's been a few years since I've built it from source, and I didn't relish having to wait for the long checkout and build process. Unfortunately, lldb doesn't seem to want to be attached to Safari, presumably because System Integrity Protection (SIP) disallows debugging of system applications. Fortunately, nightly builds of WebKit are not protected by SIP, and they exhibited the same problem. To figure out which process to attach to (web content runs in a separate process from the main application), Apple's documentation revealed the helpful debug option to show process IDs in page title. Thus I was able to attach to the process rendering the problematic page:

$ lldb
(lldb) process attach --pid 15079
Process 15079 stopped
...

The next thing to figure out was what function to break in. Looking at the implementations of scrolling DOM APIs it looked like they all ended up calling WebCore::RenderObject::scrollRectToVisible, so that seemed like a promising choke point.

(lldb) breakpoint set -M scrollRectToVisible
Breakpoint 1: 2 locations.

(the output says that two breakpoints are set, since it also matches WebCore::RenderLayer::scrollRectToVisible, but that turned out to be a happy accident)

After using continue command to resume execution and reproducing the problem, I was very happy to see that my breakpoint was immediately triggered. I could then get the stack trace that I was after:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2
  * frame #0: 0x000000010753eda0 WebCore`WebCore::RenderObject::scrollRectToVisible(WebCore::SelectionRevealMode, WebCore::LayoutRect const&, bool, WebCore::ScrollAlignment const&, WebCore::ScrollAlignment const&)
    frame #1: 0x0000000106b5da64 WebCore`WebCore::FrameView::scrollToAnchor() + 292
    frame #2: 0x0000000106b55832 WebCore`WebCore::FrameView::performPostLayoutTasks() + 386
    frame #3: 0x0000000106b59959 WebCore`WebCore::FrameView::layout(bool) + 4009
    frame #4: 0x0000000106b5d878 WebCore`WebCore::FrameView::scrollToAnchor(WTF::String const&) + 360
    frame #5: 0x0000000106b5d659 WebCore`WebCore::FrameView::scrollToFragment(WebCore::URL const&) + 57
    frame #6: 0x0000000106b39c80 WebCore`WebCore::FrameLoader::scrollToFragmentWithParentBoundary(WebCore::URL const&, bool) + 176
    frame #7: 0x0000000106b389c8 WebCore`WebCore::FrameLoader::finishedParsing() + 120
    frame #8: 0x00000001069d3e0a WebCore`WebCore::Document::finishedParsing() + 266
    frame #9: 0x0000000106bfb322 WebCore`WebCore::HTMLDocumentParser::prepareToStopParsing() + 162
    frame #10: 0x0000000106bfc1b3 WebCore`WebCore::HTMLDocumentParser::finish() + 211
    ...

It looked like WebKit had decided to scroll to an anchor, which was surprising, since I wasn't expecting any named anchors in the document. After reading through the source of WebCore::FrameView::scrollToAnchor I finally understood what was happening:

// Implement the rule that "" and "top" both mean top of page as in other browsers.
if (!anchorElement && !(name.isEmpty() || equalLettersIgnoringASCIICase(name, "top")))
    return false;

As a side effect of the infrastructure change, the frame no longer had any parameters in the fragment part of the URL, but the code that was generating the URLs would always append a #. This empty fragment identifier would thus be marked as requesting a scroll to the top of the document. Once execution continued, we would end up in the previously-mentioned WebCore::RenderLayer::scrollRectToVisible method, which recurses into the parent frame, thus scrolling the whole document.

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x00000001074e0f80 WebCore`WebCore::RenderLayer::scrollRectToVisible(WebCore::SelectionRevealMode, WebCore::LayoutRect const&, bool, WebCore::ScrollAlignment const&, WebCore::ScrollAlignment const&)
    frame #1: 0x00000001074e143d WebCore`WebCore::RenderLayer::scrollRectToVisible(WebCore::SelectionRevealMode, WebCore::LayoutRect const&, bool, WebCore::ScrollAlignment const&, WebCore::ScrollAlignment const&) + 1213
    frame #2: 0x00000001074e143d WebCore`WebCore::RenderLayer::scrollRectToVisible(WebCore::SelectionRevealMode, WebCore::LayoutRect const&, bool, WebCore::ScrollAlignment const&, WebCore::ScrollAlignment const&) + 1213
    frame #3: 0x000000010753ee55 WebCore`WebCore::RenderObject::scrollRectToVisible(WebCore::SelectionRevealMode, WebCore::LayoutRect const&, bool, WebCore::ScrollAlignment const&, WebCore::ScrollAlignment const&) + 181
    frame #4: 0x0000000106b5da64 WebCore`WebCore::FrameView::scrollToAnchor() + 292
    frame #5: 0x0000000106b55832 WebCore`WebCore::FrameView::performPostLayoutTasks()
    ...

The fix was then trivial (remove the # if no parameters are needed), but it would have taken me much longer to find if I had treated the browser as a black box. As a bonus, reading through the WebKit source also introduced me to the “framesniffing” attack. The guards against this attack explained why the Mac app was most affected. There the main frame is loaded using a file:/// URL and based on WebKit's heuristics it can access any other origin, allowing the anchor scroll request to cross frame/origin boundary.

Disabling the click delay in UIWebView #

Historically, one of the differences that made hybrid mobile apps feel a bit “off” was that there would be lag when handling taps on UI elements with a straightforward click event handler. Libraries such as Fastclick were created to mitigate this by using raw touch events to immediately trigger the event handlers. Though they worked for basic uses, they added JavaScript execution overhead for touch events, which leads to jank.

More recently, both Chrome on Android and Safari on iOS have removed this limitation for pages that are not scalable. That was the fundamental reason why there was a delay for single taps — there was no way to know if the user was trying to do a double-tap gesture or a single tap, so the browser would have to wait after the first tap to see if another came.

I assumed that this would apply to web views embedded within apps, but I was disappointed to see that Quip's behavior did not improve on iOS 9.3 or 10.0 (we have our own Fastclick-like wrapper for most event handlers, but it didn't apply to checkboxes, and those continued to be laggy). Some more research turned up that the improvement did not apply to UIWebView (the older mechanism for embedding web views in iOS apps — WKWebView is more modern but still has some limitations and thus Quip has not migrated to it).

The WebKit blog post about the improvements included some links to the associated tracking bugs (as previously mentioned, WKWebView is entirely open source, which continues to be nice). Digging into one of the associated commits, it looked like this was a matter of tweaking the interaction between multiple UIGestureRecognizer instances. Normally the one that handles single taps must wait for the one that handles double taps to fail before triggering its action. Since the double tap one takes 350 milliseconds to determine if a tap is followed by another, it needs that long to fail for single taps. The change that Apple made was to disable this second gesture recognizer for non-scalable pages.

UIWebView is not open source, but I reasoned that its implementation must be similar. To verify this, I added a small code snippet to dump all gesture recognizers for its view hierarchy (triggered with [self dumpGestureRecognizers:uiWebView level:0]:

-(void)dumpGestureRecognizers:(UIView *)view level:(int)level {
    NSMutableString *prefix = [NSMutableString new];
    for (int i = 0; i < level; i++) {
        [prefix appendString:@"  "];
    }
    NSLog(@"%@ view: %@", prefix, view);
    if (view.gestureRecognizers.count) {
        NSLog(@"%@ gestureRecognizers", prefix);
        for (UIGestureRecognizer *gestureRecognizer in view.gestureRecognizers) {
            NSLog(@"%@   %@", prefix, gestureRecognizer);
        }
    }
    for (UIView *subview in view.subviews) {
        [self dumpGestureRecognizers:subview level:level + 1];
    }
}

This showed that the UIWebView contains a UIScrollView which in turn contains a UIWebBrowserView. That view has a few gesture recognizers, the most interesting being a UITapGestureRecognizer that requires a single touch and tap and has as the action a _singleTapRecognized selector. Sure enough, it requires the failure of another gesture recognizer that accepts two taps (it has the action set to _doubleTapRecognized, which further makes its purpose clear).

<UITapGestureRecognizer: 0x6180001a72a0; 
    state = Possible; 
    view = <UIWebBrowserView 0x7f844a00aa00>; 
    target= <(action=_singleTapRecognized:, target=<UIWebBrowserView 0x7f844a00aa00>)>; 
    must-fail = {
        <UITapGestureRecognizer: 0x6180001a7d20; 
            state = Possible; 
            view = <UIWebBrowserView 0x7f844a00aa00>; 
            target= <(action=_doubleTapRecognized:, target=<UIWebBrowserView 0x7f844a00aa00>)>; 
            numberOfTapsRequired = 2>,
        <UITapGestureRecognizer: 0x6180001a8180; 
            state = Possible; 
            view = <UIWebBrowserView 0x7f844a00aa00>; 
            target= <(action=_twoFingerDoubleTapRecognized:, target=<UIWebBrowserView 0x7f844a00aa00>)>; 
            numberOfTapsRequired = 2; numberOfTouchesRequired = 2>
    }>

As an experiment, I then added a snippet to disable this double-tap recognizer:

for (UIView* view in webView.scrollView.subviews) {
    if ([view.class.description equalsString:@"UIWebBrowserView"]) {
        for (UIGestureRecognizer *gestureRecognizer in view.gestureRecognizers) {
            if ([gestureRecognizer isKindOfClass:UITapGestureRecognizer.class]) {
                UITapGestureRecognizer *tapRecognizer = (UITapGestureRecognizer *) gestureRecognizer;
                if (tapRecognizer.numberOfTapsRequired == 2 && tapRecognizer.numberOfTouchesRequired == 1) {
                    tapRecognizer.enabled = NO;
                    break;
                }
            }
        }
        break;
    }
}

Once I did that, click events were immediately dispatched, with minimal delay. I've created a simple testbed that shows the difference between a regular UIWebView, a WKWebView and a “hacked” UIWebView with the gesture recognizer. Though the WKWebView is still a couple of milliseconds faster, things are much better.

Touch delay in various web views

Note that UIWebBrowserView is a private class, so having a reference to it may lead to App Store rejection. You may want to look for alternative ways to detect the gesture recognizer. Quip has been running with this hack for a couple of months with no ill effects. My only regret that is that I didn't think of this sooner, we (and other hybrid apps) could have had lag-free clicks for years.

Perils of Measuring Long Intervals with Performance.now() #

I recently ran into an interesting quirk when using Performance.now() to measure long-ish intervals in Quip's web app. Since it does not seem to be broadly known, I thought I would document it.

To mitigate the possibility of self-induced DDoS attacks, I recently added duplicate request detection to Quip's model layer data loading mechanism. Since we pretty aggressively cache loaded data, repeated requests for the same objects would indicate that something is going wrong. I therefore added a client-side check for the exact same request being issued within 60 seconds of the first occurrence. If it triggered, it would send a diagnostic report to our error tracking system.

This turned up some legitimate bugs (e.g. two independent code paths racing to trigger loads) as well as some false positives (e.g. retries of requests that failed should be allowed). After pushing the fixes and tweaks to the detection system, I was left with a few puzzling reports. The report claimed that a duplicate request had occurred within a very short interval, but based on other events it looked like the requests had been several minutes (or even hours) apart.

When I looked at the reports more carefully, I saw that the long time interval was always bracketed by a disconnect and reconnect of the Web Socket that we use for sending real-time updates. I hypothesized that this may have been a laptop whose lid was closed and later re-opened. Taking a look at how I measured elapsed time between requests, I saw that this was computing the delta between to high-resolution timestamps returned by Performance.now(). I was then able to reproduce this scenario locally by comparing wall-clock elapsed time with high resolution elapsed time while my computer was asleep (to see it in action, see this simple test bed). I initially did this in Chrome, but Safari and Firefox seem to have the same behavior.

Performance.now() behavior across sleep intervals

The fix was switch to using Date.now(), which otherwise worked equally well for this use-case. We didn't actually need the high-resolution guarantees of Performance.now() — the reason why it was used in the first place is because the code already had a timestamp with it in place that was used for measuring load request-to-response time. The same code runs in on our desktop app (where load times can be sub-millisecond) and so the high resolution was needed for that use case.

I am definitely not the first to run into this; I have found a few off-hand mentions of this behavior. For example, see this Stack Overflow comment and this post on elm-dev. Curiously, neither the currently published version of the time specification nor the latest draft seem to indicate that this may be a possibility. Per the spec, Peformance.now() is supposed to return the difference between the time of the call and the time origin, and presumably the origin is fixed.

As to the specifics of why this happens, I spelunked through Chrome's codebase a bit. The Performance.now implementation calls monotonicallyIncreasingTime which uses base::TimeTicks::Now which uses the CLOCK_MONOTONIC POSIX clock. I wasn't able to find any specific gotchas about macOS's implementation of that clock, but Apple does have a tech note that says that “timers are not accurate across a sleep and wake cycle of the hardware” so this is not a surprise at that low level. Within the Chrome project it is also known that base::TimeTicks is unreliable across sleep intervals. Though it's common to think of the browser environment as being very high level and abstracted away from the hardware and operating system, small nuances such as this one do sometimes leak through.