I've been downloading the IAM handwriting dataset so I can see if it's appropriate for my uses. Unfortunately, downloading straight from the site to my computer at Princeton was giving me the dismal rate of 20K/sec, not very good considering I was looking at hundreds of megs worth of data. On a hunch, I tried
dyyme.pair.com (the machine that hosts mscape.com and this site), and there I was seeing 1.1 MB/sec. Downloading from there to my computer was at the rate of 200K/sec, thus still a bottleneck but an order of magnitude better than what I was seeing before.
A bit of
traceroute-ing showed that the path from here to the
unibe.ch server (where IAM is located) went through the Abilene backbone of the Internet2 to the GÉANT "multi-gigabit" network that connects research institutions in Europe. Despite the fanciness of the networks that my connection traversed (or perhaps because of it) I was seeing very dismal transfer rates. On the other hand, the connection from
dyyme.pair.com went over the commercial/private network that Global Crossing has, with much better results. There's also something to be said about routing data manually versus letting the Internet do its thing.
Now that I have a basic version of the SQL query system up*, I'm seeing some rather odd behavior with its performance. Even simple queries like selecting the rows from an 88-record table that have a distance field under a certain threshold take a while to process. Part of this may be due to the conversion to/from strings that I have to do, but that doesn't explain it all. My current hypothesis is that this is due to GLUI taking up an inordinate amount of CPU time even the application idles. I have seen values as high as 80%, and when Thor is at a more reasonable level, WindowServer instead shoots up to 50% plus. After using QuartzDebug to turn on visible screen updates, it became apparent that the control window was being refreshed continuously. Poking through the GLUI source code showed that it always called the
finish_drawing (which does a
glFinish) function, even when the active control didn't have an idle handler. Making that call only happen when it was necessary seemed to help with the spurious refreshes, but Thor was still taking up lots of CPU time (if anything, more than before). Running Sampler on it while idle showed that it spent most of its time blocked in the main loop, waiting for the next event (GLUT seems to be built on top of Cocoa, so this was in
NSApplication::nextEventMatchingMask). This made little sense, but I remembered a trick Szymon had mentioned. Adding a
usleep(10 * 1000) call in my idle function (which GLUI calls after its done with its refreshing) brought the CPU utilization down to around 1%. This is somewhat kludge-ish, but I don't have the time (or the interest really) to delve into GLUT and/or GLUI's event handling, so this will do for now.
Of course, after all that work it turned out that GLUI and the idle function had nothing to do with the poor performance that I was seeing. In retrospect this makes sense, since GLUT appears to be single threaded and thus while the query function is running it was unlikely that the idle one was getting called. Still, fixing this means that it'll be much less annoying to leave Thor running in the background. Running Sampler while the actual query was going on showed that the program was spending most of its time computing the query stroke to basic shape stroke histogram distance. This is because some histograms (e.g. A3) are very dense (almost every bucket is populated) and thus the distance function takes a while. I thought that maybe I could thin out the connections a bit (if the buckets are too far apart there's no point in creating a connection between them), but that didn't really help for the very dense histograms and caused problems for the sparse ones.
I've also tried using a different COIN/OSI solver (OSL instead of CLP) but the FAQ suggests that this is as good as its going to get. I'm currently looking into alternative implementations, such as WNLIB. Unfortunately it seems like WNLIB's build process is a bit antiquated and doesn't work too well on Darwin/Mac OS X out of the box. This will need more attention tomorrow.
* I realize I've never actually written out how my query system works. This is worth doing, if only to have something to reference to (or just copy and paste) in my writeup:
- Basic shapes B1, B2, ..., Bn are loaded (currently just a circle and a square)
- For each basic shape i we compute histogram Hi1, Hi2, ..., Him (currently inner angles, D2, A3, D3)
- When loading stroke data, we compute for each stroke j histograms Hj1, Hj2, ..., Hjm, along with the distances (using the Earth Mover's metric) between Hi1 and Hj1, Hi2 and Hj2, ..., Him and Hjm (each loaded stroke's histogram has n distances, one to each of its counterparts in the basic shapes)
- All of the above (strokes, histograms, distances) are stored in a SQL database. Distances are actually stored as ratios, e.g. if histogram Hj1 had distances of 20, 40, and 140 to basic shape histograms Him (for i from to n) then we should store 0.1, .2 and 0.7 as the distance values.
When a query is made, we go through the following steps:
- For the query stroke, we compute its histograms Hq1, Hq2, ..., Hqm
- For each of those histograms, we compute the distances to their counterparts in the basic shapes, and determine the ratios as described above.
- We do a SQL query, looking for strokes in the dataset whose histograms had similar distance ratios to the basic shapes.
- This is done per histogram type, and then if a certain portion of the histograms agree that a stroke was in the same distance range as the query (currently 3/4), then we include it in the result set.
Currently no ranking is done (though this would be easy since we can look for how close the distance ratios really are), no further refinement is done (since we are now restricted to very few stokes, it should be feasible to do actual distances from the query stroke to the dataset strokes) and we only work at the simple stroke level, as opposed to looking at entire sketches. But it's a start.