Showing posts with label pTunes. Show all posts

ourTunes: Search iTunes Shares #

Just like it happened before, my incredible ability to procrastinate (or more precisely, start too many projects) meant that someone else not only had the same idea (not all that surprising really) but implemented it as well.

libopendaap and Mac OS X: follow-up #

It turned out that my hypothesis as to why libopendaap works on (x86) Linux but not on Mac OS X was correct - it is an endian issue. Thanks to a tip from jfpoole, the problem was traced to authentication/md5.c, specifically lines 47 - 51 that attempt to pick a byte reversal routine (if any). Upon running the ./configure script, WORDS_BIGENDIAN is #define'd to 1 (line 72). However, line 58 of authentication/md5.c does a #ifndef WORDS_BIGENDIAN test to see if the byte reversal routine should be set to a dummy one. By changing this to #if WORDS_BIGENDIAN and rebuilding libopendaap, one is able to successfully connect to severs. This still does not fix the auto-discovery issue (given its non-deterministic behavior, I still think it's a threading issue), so I'll still have to use CFNetServices for that.

On the Java front, One2OhMyGod became orphaned, and has since been resurrected as AppleRecords. A new release of the latter incorporates the necessary changes in order to support connections to iTunes 4.5 clients, thus this seems like a viable option as well. However, last I checked building this seems to be a pain (since it requires libraries like JRendezvous, a.k.a. JmDNS and JavaLayer). I also haven't looked through the source, to see how well separated the DAAP connection stuff is from the GUI and MP3 playback code.

Building and Running libopendaap on Mac OS X #

The API for libopendaap seemed reasonable enough that I thought I could put together a simple crawler for iTunes shares in a few hours. Although this didn't quite happen, the situation still seems promising; with there being two possible approaches that I could take.

The first step is to build the library. The release notes claim that as of version 0.1.3 it should compile fine for Mac OS X. This it does, with the caveat that there must be no spaces in the path to the build directory (otherwise the ar fails, and although I'm sure some makefile tweaking would fix this, the build process looked convoluted enough that I didn't want to deal with it). Once this is done, the result is libopendaap.a in the .libs directory (haven't quite figured out why it's a hidden directory, but it doesn't matter).

libopendaap is very self-contained, in that is also handles the Rendezvous/mDNS discovery of hosts. As a result, using it should be a simple matter of creating a new client with DAAP_Client_Create, getting a list of machines with DAAP_Client_EnumerateHosts, connecting to each one and then using DAAP_ClientHost_GetDatabases to actually do the crawling. An initial run of code like this did nothing, in that no hosts were returned. A bit of skimming through the library codebase revealed that the host discovery is done asynchronously on a different thread. As a temporary hack, I added a sleep(10) call between the client creation and host enumeration steps. This seemed to work, in that machines were being found, but as a whole the program was behaving very crappily (i.e. non-deterministically, suggesting threading issues, something I wasn't too thrilled to deal with).

After trying for a bit to trace the execution path and figure out what was going on, I decided to punt the issue and not use libopendaap's discovery at all. For now I could add hosts by hand (using the thoughtfully provided, if hack-labeled DAAP_Client_AddHost) and eventually use the CFNetServiceBrowser API built into Mac OS X's Core Foundation. This got me slightly further, but now the process failed in the connection stage, with it being returned an 403 (forbidden) HTTP status code when requesting the /update?session-id=id&revision-number=1 URL (although the /content-codes and /login requests were completed successfully). This seemed awfully familiar, i.e. it was at this point that iLeech and other DAAP libraries failed as well.

At this point, I began to doubt that libopendaap worked at all. To validate this, I dug up a VMware VM running Fedora and built the library there. I couldn't get tunesbrowser (a GTK app that uses the library) to build since I didn't have gstreamer installed, so I was reduced to running my own code. Discovery still didn't appear to work (there were no threading issues, but it didn't find any hosts either, despite my using bridged networking for the VM). However, if I specified a host by hand, it would connect successfully. Now that I have this known working case, it should be a matter of getting a few traces with tcpflow from both the Mac OS X and Linux builds, and seeing where the former goes wrong (it may be something as simple as an endian issue - I don't have a LinuxPPC install to verify this).

Alternatively, I can drop the libopendaap approach entirely, and try to base my crawler on another daap library. I'm not a big fan of the Hungarian notation that it uses in its codebase, and generally, although the API may be simple-looking, the insides seem grotty (and undocumented enough) that hacking on it to get it running on Mac OS X may be painful. As for what else I'd use, One2OhMyGod is a Java-based DAAP client that is new enough to work with iTunes 4.1 shares (not 4.5 ones, but apparently there is only a 1-byte difference in the hashing that the two use). However, I still haven't looked at its codebase to see whether it'd be any better, and from a "not reinventing the wheel" point of view, using a library that was meant for this (as opposed to hacking some functionality out of an app) is the better way to go.

Under the radar #

At some point after my last experiment with iLeech/OpenDAAP/dapple, libopendaap cropped up. It claims to support the MD5-based authentication scheme that iTunes 4.1+ implements via the Client-DAAP-Validation header. This seems very promising, assuming the author can keep up the pace with Apple's engineers (though apparently they're not trying too hard, with the main change in iTunes 4.5 being a single byte) and more importantly, stay ahead of their lawyers.

Now, the decision that must be made is whether to use libopendaap in my would-be iTunes crawler directly, or to try and integrate its authentication scheme into the iLeech source base first, and then work from there.

Monkey see, monkey do #

It turns out that GUI Scripting isn't quite the way to go. It relies on the accessibility APIs for its functionality. iTunes doesn't support them, beyond the menus and the scrollbars of the main window. I don't know whether this is intentional, or if this is just a remnant of iTunes' lineage as a Mac OS 9 app (presumably it doesn't use HIViews yet). In any case, GUI Scripting isn't completely useless, I can use the menu-bar to disconnect from shares when I'm done with them, and the scrollbars will allow me to move through the list of shares (though this is hard to test since I'm not on campus at the moment, so I have access to a grand total of two shared libraries). For the actual connection to each library, I am reduced to resizing the iTunes window to a known size and putting in a known location, and then programatically clicking in various places (using XTool for this). Seems to work pretty well so far, though I'm not sure how I'll handle password protected shares yet.

Also wrote a small CLI program to count how many iTunes shares there are out there (the alternative was to a screen capture and analyze it programatically or query the scrollbar (if any), but this seems less brittle). CFNetServices (the non-Cocoa way of getting at Rendezvous functionality) is pretty neat and easy to use.

P.S. The title is in reference to Apple's testing methodology for GUI software.

Good news and bad news #

Good news: I've finally gotten around to using the Jakarta Commons Project's HttpClient user-agent, and it seems to do persistent connections properly (e.g. the succession of /server-info, /login, and /update URL requests are done on the same socket).

Bad news: The thus-modified iLeech still can't connect to an iTunes 4.2 server. There's an HTTP header, Client-DAAP-Validation, followed by 32 hex digits, which is suspiciously MD5 hash like. Unfortunately, I don't know that they're hashing, and no one else on the net seems to have figured it out either.

Good news: Inspired by MyTunes, I've come with a different approach. Script iTunes (using the new GUI scripting add-ons), to sequentially connect to all of the shares. Use tcpflow or its ilk to capture the traffic. Use the existing daap parsing library to extract the music library contents for each client, and then dump all of the results into a database. Since this uses iTunes for the actual crawling, there's no danger of Apple breaking the system by changing the API.

iLeech + Eclipse = almost there #

In my quest to find a working DAAP implementation, I decided to give the Java version of iLeech a try. I knew that it wouldn't work out of the box, but I figured that Java's networking classes may be more useful/robust than Perl's.

To actually build iLeech, I decided to give Eclipse a try (iLeech uses an extensive package hierarchy that I didn't feel like setting up by hand). It's pretty nice, at least on XP, I've yet to try the OS X build.

Once I got iLeech to build, I first tweaked a couple of things (default to "localhost" for host field, hook up the return key to the connect button, make the "Exit" menu item work) to familiarize myself with both the IDE and the code. Then I tried to enable HTTP 1.1 connection reuse, since that's what seems to prevent current DAAP implementations from connecting to iTunes 4.1. There's supposedly a Java environment variable, http.keepalive, which if set to true will take care of things, but it doesn't seem to work. The current plan is to use the Jakarta Commons HttpClient (part of the Apache project), which seems to be more robust.

Oh, and for the longest time I couldn't connect at all with iLeech to beria, but once I did a traceroute I realized I was past the TTL of 2 limit that iTunes imposes.

dap. daap. daaap. #

Considering ressurecting pTunes as a searching for the campus iTunes Rendezvous sharing community (number of people sharing seems to have doubled since I last checked).

Sounds easy enough, given the existence of OpenDAAP, right? Well, I can't seem to get any of the Perl based tools to work (dapple and a stand-alone script that I found). The Java tools aren't having much luck either.

Hacker interest seems to have dropped since the initial iTunes 4 release in April. I've seen a few references to 4.0.1 requiring a bumped version number, and the necessity of keeping the HTTP/1.1 connection alive (LWP seems to be ignoring my request for this), but then others claim that existing tools should just work as is with 4.1.

Boo.

Progress Report #

  • Can now have mutliple matches in ASIN search for pretty names in recommendations list (see recommends for essential mix, esp. MMII), and in those cases replace artist name with replace with "Various Artists")
  • Added "original soundtrack" to normalization list
  • "Varios" is way of saying "Various Artists" too (see "Woman on Top" soundtrack)
  • Don't normalize capitalization for words under three letters (e.g. ATB/ATC/BT)
  • Normalized special chars better (see the two bjorks)
  • Comma support for normalization (e.g. "Cash, Johnny" vs. "Johnny Cash")
  • Don't specify cover image size anymore, due to variations (see "Politics of Dancing")

Progress Report #

  • Changed Amazon importing code to work with compilations better (don't specify artist in search if album has more than 4 artists) and also do fuzzy string match to make sure that we get the right thing back (no more 'Blank ∓ Jones' matching 'Grosse Point Blanke')

Progress Report #

  • Redirect Netscape 4.x to upgrade page
  • Added stripping of parentheses from titles/names when they enclose the entire string (e.g. "(aerosmith)")

Progress Report #

  • Got Amazon web services kit & token
  • Added selective info bar for testing
  • Wrote XSLT to get only album covers back from Amazon
  • Did initial overnight cover crawling

Progress Report #

  • Switcing genre now clears out album list
  • Randomized ordering of user popup
  • Turned title into download link
  • Removed non-alpha numeric characters from query terms
  • Added links for top 10 songs
  • Returned faster from searches with zero results (don't do LIKE search if terms look reasonable (no special chars, length > 5))
  • Zero results prints message accordingly
  • Imported march FreeDB, evaluate improvements
  • Added selecting all albums for a genre (optimized with STRAIGHT_JOIN)

Progress Report #

  • Fixed searching for stuff like "you said" and "zero 7" and other multi-word queries where MATCH...AGAINST didn't work
  • Made sure the "&x=0&y=0" addition to the search string (done by mozilla) doesn't mess things up
  • Made stats display

Progress Report #

  • Use gray for unfocused frame hilight colors
  • Speedup searches by creating fulltext index
  • Fixed duplicates, see Bachelor Number One
  • Make artist/album name pretty version choosing based on popularity, not just first item returned
  • Fixed Experimental Noise -> Black Rebel Motorcycle Club -> Unknown Album error
  • '<Album name< single' should now strips out 'single' when normalizing

Genre Work #

Added hiding for unused canonical genres (though it's not perfect yet, e.g. Salsa's parent is hidden but it shows up indented twice) and fixed the JavaScript so that genres with quotes in them like "Drum 'n' Bass" work now.

Should probably group Rock, Modern Rock and Punk Rock under the same category (and other things like that, such as merging the two Ska's) but I'm scared of messing up the genre mappings, and there's no way in hell I'm redoing all 2356 of them.

Ego Boosting++ #

Yet more ego boosting for pTunes. Yesterday for Prohibition Night we sat next to some sophmores, and when we introduced ourselves, one kid was like "Mihai...pTunes?" Made my night, even if it turned out to be the same person who mailed me the other day. Also got a mail today about pTunes, with another kid saying that he liked it and also wanting access to the source code so that he could hack up a movie equivalent.

Mac Integration #

Got a mail from a Princeton kid (the first!) about pTunes file links not working on Macs, and realized that this is something I'm gonna have to deal with, if only for my sake. For now I just sniff Mac browsers and spit out instructions on how to connect to a SMB share and then navigate to the right directory. Eventually though, I'd like to have a small helper app that registers itself as handling the smb protocol, and then takes care of the mounting and navigation.

IW Options #

Talked with Cook about turning pTunes into an independent work project and using his Phd student's classification code to help with genre determination. He seemed somewhat interested, but in the decided that it wasn't gonna happen because:
1. The classification code is basically done, it's unlikely that I could improve on something George had been working on for 5 years. What remains is more of database-ish and metadata stuff, which isn't really his focus.
2. He's busy anyway with three classes and lots of other kids to advise.
Then I went to see Kernighan, and pitched pTunes from a web service angle (i.e. using other web services like freedb and Amazon to get extra information and in turn exposing my dataset as a web service). He seemed pretty interested, and very willing to serve as my advisor. This is probably the route that I'll choose, because adding an extra project (e.g. if I were to do the projector/whiteboard stuff with Szymon) to my workload would mean that something has to give.

Genre Mappings #

So there are 5097 distinct genre strings in the database. Started by first mapping the more manageable 500 or so that have turned up in ID3 tags. Made up reasonably pretty web interface for this to make my life easier. Seems to work well enough, with reasonable performance. Later on added about half of the freedb genres (so up to 2356 mappings), with the net result being that 38% of artists are still uncategorized. I seem to have hit the point of diminishing returns, i.e. the last 1,000 mappings that I added only resulted in the categorization of 150 (out of 14,000) artists. Will probably finish up mappings anyway, but should look into obtaining even more datasets (Amazon?) and/or cleaning up the artist list (e.g. collapsing "A with B", "A vs. B" and so on). Should also make genre estimation merge more generic estimations with more specific ones, though have to figure out risk of one (or a few) bad (specific) mapping screwing things up.