Twitter Streaming API from Python #

I'm playing around with Twitter's streaming API for a (personal) project. tweetstream is a simple wrapper for it that seemed handy. Unfortunately it has a known issue that the HTTP library that it uses (urllib2) uses buffering in the file object that it creates, which means that responses for low volume streams (e.g. when using the follow method) are not delivered immediately. The culprit appears to be this line from urllib2.py (in the AbstractHTTPHandler class's do_open method):

fp = socket._fileobject(r, close=True)

socket._fileobject does have a bufsize parameter, and its default value is 8192. Unfortunately the AbstractHTTPHandler doesn't make it easy to override the file object creation. As is pointed out in the bug report, using httplib directly would allow this to be worked around, but that would mean losing all of the 401 response/HTTP Basic Auth handling that urllib2 has.

Instead, while holding my nose, I chose the following monkey patching solution:

# Wrapper around socket._fileobject that forces the buffer size to be 0
_builtin_socket_fileobject = socket._fileobject
class _NonBufferingFileObject(_builtin_socket_fileobject):
  def __init__(self, sock, mode='rb', bufsize=-1, close=False):
    builtin_socket_fileobject.__init__(
        self, sock, mode=mode, bufsize=0, close=close)

# Wrapper around urllub2.HTTPHandler that monkey-patches socket._fileobject
# to be a _NonBufferingFileObject so that buffering is not use in the response
# file object
class _NonBufferingHTTPHandler(urllib2.HTTPHandler):
  def do_open(self, http_class, req):
    socket._fileobject = _NonBufferingFileObject
    # urllib2.HTTPHandler is a classic class, so we can't use super()
    resp = urllib2.HTTPHandler.do_open(self, http_class, req)
    socket._fileobject = _builtin_socket_fileobject
    return resp

Then in tweetstream's urllib2.build_opener() call an instance of _NonBufferingHTTPHandler can be added as a parameter, and it will replace the built-in HTTPHandler.

5 Comments

Hey Mihai! My preference for doing HTTP hacking in Python these days is Joe Gregorio's httplib2: http://code.google.com/p/httplib2/ It's got quirks of its own, but it works pretty well and it's simple to understand what it's doing. Haven't tried it with the Twitter stream, though.
Another option is to set the _default_bufsize class variable after importing socket: socket._fileobject.default_bufsize = 0
@id Thanks, much cleaner.
Thanks, exactly what I was looking for.

And you can set the socket to be line buffered with socket._fileobject.default_bufsize = 1. I'm not sure how significant the performance difference really is though.
@Eric with the default buffer size and a low volume stream you might see delays of several seconds (or even minutes) before your code gets handed the streamed tweets.

Post a Comment