Twitter PubSubHubbub Bridge #

During the Twitter DDoS attacks, there was a thread on the Twitter API group about using PubSubHubbub to get low latency notifications from Twitter. This would be an alternative to the streaming API that Twitter already has. The response from a Twitter engineer wasn't all that positive, and it is indeed correct that the streaming API already exists and seems to satisfy most developers' needs.

However, my interest was piqued and I thought it might be a useful exercise to see what Twitter PubSubHubbub support could look like. I therefore decided to write a simple bridge between the streaming API and a PubSubHubbub hub. The basic idea was that there would be a simple streaming client that would in turn publish events to a hub. The basic flow would be:

Twitter PubSubHubbub flow 1
(created using Kushal's Diagrammr)

I'm using FriendFeed as the PubSubHubbub client, but obviously anything else could substitute for it. The "publisher" is where the bulk of the work happens. It uses the statuses/filter streaming API method* to get notified of when a user of interest has posted, and then it notifies the reference hub that there is an update. It also has a companion Google App Engine app that serves feeds for Twitter updates. This is both because the hub needs a feed to crawl and because the feed needs to have a <link rel="hub"> element, something which Twitter's own feeds don't have. Unfortunately the publisher itself can't run on App Engine since the streaming API requires long-lived HTTP connections, and App Engine will not let requests execute for more than 30 seconds. I considered using the tasks queue API to create a succession of connections, but that seemed too hacky.

In any case, it all seems to work, as this screencast shows:

On the right is the Twitter UI where messages are posted. In the middle is the publisher which receives these messages and relays them to the hub. On the left is FriendFeed which gets updates from the Hub.

Latency isn't great, and as mentioned in the group thread, Twitter could have to deal with the hub being slow. Part of the reason why latency isn't great is because the hub has to crawl the feed to get at the update, even though the publisher already knows exactly what the update is. This could be fixed by running a custom hub (possibly even by Twitter, see the hub can be integrated into the publisher's content management system option), with the flow becoming something like:

Twitter PubSubHubbub flow 2

In the meantime, here's the source to both the publisher and the app.

* This was called the "follow" method until very recently.

Twitter Streaming API from Python #

I'm playing around with Twitter's streaming API for a (personal) project. tweetstream is a simple wrapper for it that seemed handy. Unfortunately it has a known issue that the HTTP library that it uses (urllib2) uses buffering in the file object that it creates, which means that responses for low volume streams (e.g. when using the follow method) are not delivered immediately. The culprit appears to be this line from urllib2.py (in the AbstractHTTPHandler class's do_open method):

fp = socket._fileobject(r, close=True)

socket._fileobject does have a bufsize parameter, and its default value is 8192. Unfortunately the AbstractHTTPHandler doesn't make it easy to override the file object creation. As is pointed out in the bug report, using httplib directly would allow this to be worked around, but that would mean losing all of the 401 response/HTTP Basic Auth handling that urllib2 has.

Instead, while holding my nose, I chose the following monkey patching solution:

# Wrapper around socket._fileobject that forces the buffer size to be 0
_builtin_socket_fileobject = socket._fileobject
class _NonBufferingFileObject(_builtin_socket_fileobject):
  def __init__(self, sock, mode='rb', bufsize=-1, close=False):
    builtin_socket_fileobject.__init__(
        self, sock, mode=mode, bufsize=0, close=close)

# Wrapper around urllub2.HTTPHandler that monkey-patches socket._fileobject
# to be a _NonBufferingFileObject so that buffering is not use in the response
# file object
class _NonBufferingHTTPHandler(urllib2.HTTPHandler):
  def do_open(self, http_class, req):
    socket._fileobject = _NonBufferingFileObject
    # urllib2.HTTPHandler is a classic class, so we can't use super()
    resp = urllib2.HTTPHandler.do_open(self, http_class, req)
    socket._fileobject = _builtin_socket_fileobject
    return resp

Then in tweetstream's urllib2.build_opener() call an instance of _NonBufferingHTTPHandler can be added as a parameter, and it will replace the built-in HTTPHandler.