Twitter PubSubHubbub Bridge #

During the Twitter DDoS attacks, there was a thread on the Twitter API group about using PubSubHubbub to get low latency notifications from Twitter. This would be an alternative to the streaming API that Twitter already has. The response from a Twitter engineer wasn't all that positive, and it is indeed correct that the streaming API already exists and seems to satisfy most developers' needs.

However, my interest was piqued and I thought it might be a useful exercise to see what Twitter PubSubHubbub support could look like. I therefore decided to write a simple bridge between the streaming API and a PubSubHubbub hub. The basic idea was that there would be a simple streaming client that would in turn publish events to a hub. The basic flow would be:

Twitter PubSubHubbub flow 1
(created using Kushal's Diagrammr)

I'm using FriendFeed as the PubSubHubbub client, but obviously anything else could substitute for it. The "publisher" is where the bulk of the work happens. It uses the statuses/filter streaming API method* to get notified of when a user of interest has posted, and then it notifies the reference hub that there is an update. It also has a companion Google App Engine app that serves feeds for Twitter updates. This is both because the hub needs a feed to crawl and because the feed needs to have a <link rel="hub"> element, something which Twitter's own feeds don't have. Unfortunately the publisher itself can't run on App Engine since the streaming API requires long-lived HTTP connections, and App Engine will not let requests execute for more than 30 seconds. I considered using the tasks queue API to create a succession of connections, but that seemed too hacky.

In any case, it all seems to work, as this screencast shows:

On the right is the Twitter UI where messages are posted. In the middle is the publisher which receives these messages and relays them to the hub. On the left is FriendFeed which gets updates from the Hub.

Latency isn't great, and as mentioned in the group thread, Twitter could have to deal with the hub being slow. Part of the reason why latency isn't great is because the hub has to crawl the feed to get at the update, even though the publisher already knows exactly what the update is. This could be fixed by running a custom hub (possibly even by Twitter, see the hub can be integrated into the publisher's content management system option), with the flow becoming something like:

Twitter PubSubHubbub flow 2

In the meantime, here's the source to both the publisher and the app.

* This was called the "follow" method until very recently.

Twitter Streaming API from Python #

I'm playing around with Twitter's streaming API for a (personal) project. tweetstream is a simple wrapper for it that seemed handy. Unfortunately it has a known issue that the HTTP library that it uses (urllib2) uses buffering in the file object that it creates, which means that responses for low volume streams (e.g. when using the follow method) are not delivered immediately. The culprit appears to be this line from urllib2.py (in the AbstractHTTPHandler class's do_open method):

fp = socket._fileobject(r, close=True)

socket._fileobject does have a bufsize parameter, and its default value is 8192. Unfortunately the AbstractHTTPHandler doesn't make it easy to override the file object creation. As is pointed out in the bug report, using httplib directly would allow this to be worked around, but that would mean losing all of the 401 response/HTTP Basic Auth handling that urllib2 has.

Instead, while holding my nose, I chose the following monkey patching solution:

# Wrapper around socket._fileobject that forces the buffer size to be 0
_builtin_socket_fileobject = socket._fileobject
class _NonBufferingFileObject(_builtin_socket_fileobject):
  def __init__(self, sock, mode='rb', bufsize=-1, close=False):
    builtin_socket_fileobject.__init__(
        self, sock, mode=mode, bufsize=0, close=close)

# Wrapper around urllub2.HTTPHandler that monkey-patches socket._fileobject
# to be a _NonBufferingFileObject so that buffering is not use in the response
# file object
class _NonBufferingHTTPHandler(urllib2.HTTPHandler):
  def do_open(self, http_class, req):
    socket._fileobject = _NonBufferingFileObject
    # urllib2.HTTPHandler is a classic class, so we can't use super()
    resp = urllib2.HTTPHandler.do_open(self, http_class, req)
    socket._fileobject = _builtin_socket_fileobject
    return resp

Then in tweetstream's urllib2.build_opener() call an instance of _NonBufferingHTTPHandler can be added as a parameter, and it will replace the built-in HTTPHandler.

Exporting likes from Google Reader #

I started this as another protip comment on this FriendFeed thread about Reader likes but it got kind of long, so here goes:

Reader recently launched liking (and a bunch of other features). One of the nice things about liking is that it's completely public*. It would therefore make sense to be pretty liberal with liking data, and in fact Reader does try to expose liking in our feeds. If you look at my shared items feed you will see a bunch of entries like:

<gr:likingUser>00298835408679692061</gr:likingUser>
<gr:likingUser>11558879684172144796</gr:likingUser>
<gr:likingUser>07538649935038400809</gr:likingUser>
<gr:likingUser>09776139491686191852</gr:likingUser>
<gr:likingUser>02408713980432217881</gr:likingUser>
<gr:likingUser>05429296530037195610</gr:likingUser>

These are the users that have liked. Users are represented by their IDs, which you can use to generate Reader shared page URLs. More interestingly, you can plug these into the Social Graph API to see who these users are.

Liking information isn't just limited to Reader shared item feeds. If you use Reader's view of a feed, for example The Big Picture's, you can see the <gr:likingUser> elements there too. This means that as a publisher you can extract this information and see which of your items Reader users find interesting.

For now liking information that is included inline in the feed is limited to 100 users, mainly for performance reasons. That number may go up (or down) as we see how this feature is used. However, if you'd like to get at all of the liker information for a specific item, you can plug in an item ID into the /reader/api/0/likers API endpoint, and then get at it in either JSON or XML formats.

* I've seen some wondering what the difference between liking, sharing and starring is. To some degree that's up to each user, but one nice thing about liking is that it has less baggage associated with it. We learned that if we try to redefine existing behaviors (like sharing) users get upset.

HTML Color OneBox #

Work is the sort of place that cares about specific colors, so HTML color hex triplets come up in conversation quite often. Neil suggested that this should be a OneBox in search results. It occurred to me that this could done via the Subscribed Link feature that we offer for search results. It turned out that subscribed links can use gadgets, which meant that an inline preview of colors was even possible. Regular expression matching also meant that I didn't have to list out every color by hand. This page has more information on the OneBox, or you can subscribe directly.

Color OneBox preview

Once you have installed this, you can search for things like #fafafa or #ccc and get an immediate preview (in fact, the # can be omitted).

Porting Twitter Digest to Google App Engine #

I've been meaning to play around with Google App Engine for a while, and as a quick project, I decided to port Twitter Digest to it (not as exciting as Kushal's Millidunst Calculator). This looked to be pretty straightforward: the original version was already in Python, and wasn't very complicated (just a single CGI script). It did indeed end up pretty easy; the whole process took a couple of hours.

The first step was to port the script from CGI-style invocation to the App Engine webapp framework. Then I looked into what it would take to get Python Twitter (the library I used for fetching data from Twitter) running. Switching it from urllib2 to urlfetch was pretty painless (though I don't use the posting parts of the API, so I didn't check if those work too). The other part of the library that I was relying on was its caching mechanism (since the digests are daily, there's no point in querying Twitter more often). DeWitt (the library's author) had thoughtfully put the caching functionality into a separate class, so it was easy to replace it with another one that implemented the same interface but was backed by App Engine's datastore.

The result (complete with App Gallery entry) is not that exciting, in the sense that it functions identically to the original. The only issue that I've run into so far is that when there are several cache misses, the URL fetches can take long enough that the request hits App Engine's deadline. However, since the successful fetches are cached, repeating the request will eventually succeed (so if consuming the digest via a feed, this shouldn't be a big deal). Ideally the urlfetch functionality would also support asynchronous fetches, since it would be easy to adapt the code to fetch all user timelines in parallel.

Update on 11/23/2008: Since I've gotten some requests for the modifications to twitter-digest that I made to get it to run on App Engine, here's a patch.

Intern on the Google Reader team #

Having interns has worked out well for the Reader team. Following my blog post, we were very pleased to get Nitin Shantharam and Jason Hall to help us out with Reader development. Their stints on the team resulted in a a bunch of features, and Jason is now back at Google working full-time (Nitin wasn't a slacker, he's just still in school).

We're looking for another intern or two this year. Internships generally last a couple of months to twelve weeks, are for full-time students, and would be in Google's Mountain View, California office. You can work on either Reader's backend (a C++ system for crawling millions of feeds, handling lots of items being read, shared, starred or tagged per second) or frontend (Java servers and JavaScript/AJAX-y craziness) depending on your interests and experience.

If you or anyone you know is interested in this internship, contact me at mihaip at google dot com. This page also has more general information about interning at Google.

persistent.coffee #

I've been trying my hand at latte art. Though I have a very long way to go, I've been documenting my efforts, with a hope of learning from my mistakes. Blogger's mobile support makes it pretty easy to collect pictures, and I've finally gotten around to making a decent template for the "blog."

coffee.persistent.info is the result. Technically, this isn't a Blogger template, since I just have some static HTML as the content. Instead, it uses the JSON output that Blogger's GData API supports. Rendering the page in JavaScript allows for more flexibility. I wanted to make pictures that I liked take up 4 slots (a layout inspired by TwitterPoster). This imposed additional constraints (in order to prevent overlap between sequential large pictures). The display is generally reverse-chronological starting from the top left, but images are occasionally shuffled around to prevent such overlaps. There is also a bit of interactivity, the pictures are clickable to display larger versions. To help with all this, I've been experimenting with jQuery (also on Mail Trends), and am liking it quite a bit.