I've been a fan of Overheard in New York for a while. At some point, it occurred to me that each quote has a pretty precise location attached to it and that it would be cool to plot all of them on a map. I eventually got motivated enough to actually do it. The result is overplot, a Google Maps API-powered visualization of all of the quotes I could get my hands on.
The most basic issue with implementing this is geocoding all of the location strings (like "Canal & Broadway") to a latitude/longitude pair. When I began this, the best option seemed to be geocoder.us. While it was good enough for prototyping, it quickly became apparent that the quality of data was not good enough. The most severe problem seemed to be that addresses in the southern part of Manhattan were off by half a block to the northwest. Thankfully, a few weeks after I began working on this, Google added a geocoding component to their API. Their geocoder turned out to be much faster and reliable. It is not perfect, but since the set of addresses is pretty tightly constrained, I was able to add some rewriting rules to make the input more easily parsed. As of right now, 54% of the addresses are geocoded. One remaining issue is that some the locations are actually business (like "Saks Fifth Avenue"). I may be able to use the Google AJAX Search API to do local searches for them and get their actual addresses.
I didn't want to directly scrape the HTML of the site to extract all of the quotes. I ended up using the data stored in Google Reader's archive of the site's feed. This allowed me to get at the quotes themselves more easily, without having to worry about the chrome of the site. Going back to October 6, 2005, I had in hand 5,578 quotes at 2838 locations.
Since I had accumulated so many quotes, indicating them with the standard marker object had performance problems since so many can be visible in the same area. This is a well-known problem with the version 2 of the Maps API, and the traditional solution is clustering. However, that doesn't really make sense because when zoomed in at the street level, one has no choice but to display the hundred or so markers that are visible at the same time (the southern part of Manhattan is particularly densely covered).
In the end, my solution was to do my own marker implementation. Instead of each marker being its own overlay, I put all of them in the same overlay (see the
QuotesOverlay class). Additionally, I did not split each marker into several layers (shadow, image, click area) - having the shadow be part of the image works well enough. In order to deal with clicks while in the shadow of an info window, I added a global click handler that checks if the click event falls in the boundaries of a marker (see the
handleMapClick). Finally, I hide and show markers so that only those bounded by the currently viewed area are displayed. All of this led to a performance improvement by a factor of 8 on Firefox Mac.
I then ran into a user interface problem. Even with marker rendering being efficient, when zoomed out the island of Manhattan becomes a sea of red, which is not very useful. I decided that switching to a neighborhood mode made more sense there. The problem was where to get the neighborhood boundaries. The map used in cabs seemed like a good start, but it made some interesting choices (e.g. TriBeCa was not bounded by Canal St.). New York Magazine's map also turned out to be of dubious quality (e.g. it combined Gramercy and Murray Hill). In the end, the best resource turned out to be the Wikipedia's list. Even there there is some debate (an mild edit war on the souther border of Spanish Harlem - 86th or 96th street - appears to be going on). I'm not entirely satisfied with the current divisions, but they generally make sense.
Once I actually plotted the 28 neighborhoods on the map, I noticed that it was slow to redraw, especially in Firefox. I was using
GPolylines to draw the neighborhood outlines, one per area. Possily due to the switch to SVG for polyline rendering, this seemed to have high overheard. In the end, similarly to my solution to the marker performance problem, I switched to a single overlay for all neighborhood outlines. Inside this overlay I used a canvas object to do my own rendering. To deal with MSIE's lack of a canvas implementation, I used the ExplorerCanvas implementation of the interface on top of MSIE's VML API.
The site loads all of the data upfront (in a 728K almost-but-not-quite JSON file). This allowed me to do a simple client-side search that works pretty quickly once all of the quotes are tokenized (done the first time a search is executed).
Update on 12/17/2012: The code, including the scraping/generation tools is now available on GitHub.