Marketing in Action #

I happened to glance at the back of the DVD cover for my copy of Trainspotting and was very amused to read the following:

The motion picture sensation that wowed critics and audiences nationwide, Trainspotting delivers a wild mix of rebellious action and wicked humor! It's the story of four friends as they try to make it in the world on their own terms...and who end up planning the ultimate scam! Powered by an outstanding cast of rising young stars (including Emma's Ewan McGregor) and a high-energy soundtrack, Trainspotting is spectacular, groundbreaking entertainment.

Compare this to the much more accurate (if somewhat pretentious) plot summary from IMDB:

A wild, freeform, Rabelaisian trip through the darkest recesses of Edinburgh low-life, focusing on Mark Renton and his attempt to give up his heroin habit, and how the latter affects his relationship with family and friends: Sean Connery wannabe Sick Boy, dimbulb Spud, psycho Begbie, 14-year-old girlfriend Diane, and clean-cut athlete Tommy, who's never touched drugs but can't help being curious about them...

Autosaving Form Data #

Due to my penchant for beta browsers and/or clumsy fingers, I have on more than one occasion lost text I had been laboring over in a form. Given that the modus operandi of most blogging and wiki applications is for the user to dump his/her thoughts in a <textarea>, this is presumably getting more and more common. The usual solution to this is to do all input in a more capable application or, at the very least, to periodically do a select-all-copy on your text. Unfortunately this is all rather tedious and clumsy. OmniWeb is the only browser that attempts to do something about this, but I happen to use Firefox most of the day, thus this doesn't help.

A Firefox extension would be one way to approach this, but I believe I have come up with a more browser-agnostic solution. Simply drag these two bookmarklets to your toolbar:

Autosave and Load

Clicking the first favelet will save all form data in the currently visible page (and continue to do so automatically every 30 seconds). Conversely, the second one will restore saved data for the current page.

It sounds perfect, and it almost is. The primary caveat is that cookies are used for storage, therefore we are limited to ~4K. If you're going to be writing your NaNoWriMo opus in a <textarea>, you may need to look elsewhere. One would think that segmenting the form data into 4K chunks would work, but unfortunately most web servers (Apache included) refuse to accept Set-Cookie headers longer than 4K, even if they are made up of multiple cookies. Solutions to this (beside the obvious workaround of not using cookies to begin with) are welcome.

"Autosave" must be invoked at least once per editing session, which is also not quite ideal. If you are able to edit the HTML behind that page's form, you can insert the following snippet into its source and have autosaving happen at loading time:

<script type="text/javascript">PAS_mode='save'</script>
<script src="http://persistent.info/autosave/favelet.js" type="text/javascript"></script>

This is a hosted favelet, and thus should automatically update itself if/when I find bugs (or work around the cookie length issue). This also means that it will not work in Safari (as of version 1.2). Mozilla and MSIE behave well, though getting it to operate in both took some work. For example, both browser change JavaScript behavior based on DOCTYPE and/or MIME type. Fun stuff.

Transitioning to del.icio.us #

I've noticed that I'm not really an early adopter, even though I think of myself as one. The most I can aspire to is that I'm aware of (trendy) things pretty soon after they're introduced, but using them is another matter altogether. For example, I was aware of RSS in its initial My Netscape incarnation, but I didn't start using it (via an aggregator) until mid 2002. Similarly, fancy (and clever) JavaScript-based UIs have been possible for a while, but I only really got into it this year (early experimentation notwithstanding). It took me a year to finally see the light with Bloglines. Movable Type? A year and a half. The list goes on.

As further proof of my laggardnesses, only recently (i.e. today) did I make the switch to del.icio.us, the social bookmarking site that all the cool kids are using. My previous solution to this problem was a "To-do" folder in my bookmark toolbar, to which I would add items and (much more rarely) visit and/or delete them. del.icio.us may not change that behavioral pattern, but at least now I can ignore my "oh, this is interesting" links while safely knowing that they are all tagged and ready for searching at some later date. And to be safe, it all gets backed up nightly.

Given my recent progress, expect me to use Flickr in about six months.

"We'll make it up on volume" #

I've heard of the razor-and-blades business model, and obviously some companies have gone beyond that (see wireless carriers giving away free cellphones). However, leave it to the razor industry to take things to the next level, physical product spam!

Free Quattro Razor Sample

This showed up (unsolicited) in my mailbox. I suppose it achieved its purpose, since I was compelled to try it. However, I think I'm sticking with my Mach 3. Nice try.

DIY Large Panorama Prints #

Panorama Print Photo

Although the hacks on this site have mostly focused on software, they do occasionally have a real world component. For example, I've always had a fascination with panoramic photos. Unfortunately, while it's been getting easier and easier to make your own, getting them printed is still a hassle. Ofoto allows you to order 20 x 30 prints for $23 a pop, but even that is not big enough when you consider the elongated aspect ratio of most panoramas.

However, if one is a bit more creative with the framing, large scale prints can still be achieved. For example, IKEA sells some square, 20 x 20 frames that would be a perfect fit for the above Ofoto prints. The color does not interest us since we only wish to take advantage of the entire surface area; what does matter is that the frames come with a glass plate that can be used to sandwich the photo and some nearly-invisible clips to hold everything together. The fact that they are $10 each doesn't hurt either.

Panorama

In my particular case, I wanted to use a panorama I had made at Niagara falls. The original was made up of five 6 megapixel photos taken with my Digital Rebel and stitched together with Canon's surprisingly decent bundled software. The panorama was split into four square sections, padding was added to let each piece have a 2 x 3 aspect ratio and then the prints were ordered (Ofoto currently has a 25% discount promotion, thus making the high cost easier to swallow). When doing such large prints, having a high resolution file matters - in this case I started with a 7000 x 1750 image. Another image aspect relevant at this size is the noise level, so shooting with a digital SLR (or another camera with a large sensor) on a low ISO setting like 100 is key.

Cutting up the prints and attaching them is not rocket science, but I did notice that the measurements that IKEA gives for the frames (50 x 50 cm or 19 ¾ " x 19 ¾ ") are not absolutely precise, and so my prints ended up being slightly smaller than the frame area. Being a bit more generous (e.g. using the full 20 inch height of the print) may be a good idea. The end result is much more satisfying than pre-made prints I have ordered, both from a scale perspective as well as a personalization one.

Automatically Generated Blogroll from Bloglines #

I previously described how I automatically generate my blogroll by exporting NetNewsWire's subscriptions (with the hierarchy intact), generating a snippet of HTML and then embedding it into a MovableType template that's rebuilt automatically (I haven't upgraded to 3.1 yet, thus its support for dynamic pages won't help me). In the meantime, circumstances have changed (I don't use the same computer throughout the day, or even the same platform), which have compelled me to switch to the equally great Bloglines as my aggregator. As a result, I now needed a new way to share my subscriptions.

Thoughtfully enough, Bloglines does provide a way to generate a blogroll, but it has a couple of limitations. The generated HTML is a bit clumsy, not using <ul> or <li> tags for what is clearly a list of data, and also being a bit too verbose by applying a blogrollitem class to every single item — perhaps I have drunk the semantic markup Kool Aid too deeply. But more importantly, the list only includes the website URLs for the sites, and makes no references to their feeds. A better solution is to use the Bloglines Sync API since the OPML file it exports has both the folder hierarchy and the feed URLs. I initially considered just writing an XSLT to turn the OPML into the HTML that I needed, but in the end the WebService::Bloglines Perl module turned out to be more convenient.

The net result is bloglinesBlogroll.pl, a simple script that does all of the above. It is invoked every night via a cron job, and its output is visible in the Daily Reads section of the sidebar.

Squatters of the world, unite! #

Skinning Gmail with a Custom Stylesheet #

Short Version

  • Install the URLid Mozilla/Firefox extension.
  • Download this CSS file.
  • Locate your profile folder and the chrome folder within that.
  • Copy the downloaded CSS file to the chrome folder and rename it to userContent.css (if you already have such a file, you will have to merge the two).
  • Restart Firefox.
  • Visit Gmail.

Gmail meets The CSS Zen Garden

Gmail Skin

One advantage of getting on the CSS/XHTML bandwagon (i.e. not just standards-compliant design, but truly separating markup from presentation) is that different style sheets can be swapped in, totally modifying the appearance of a site, without requiring any tag-level changes at all. This has been demonstrated to great effect in the CSS Zen Garden, and can be seen from time to time in the form of style sheet switchers on sites. However, even if a site's designer doesn't deem to provide alternate stylesheets, most modern browsers support client stylesheets that can override the appearance. Most of the time these are used for little tweaks such as hiding ads from common providers or disabling link underlining in browsers that do not specifically have such a feature (Safari).

However, there is nothing stopping one from going all out with client stylesheets and completely revamping the appearance of a site to suite one's aesthetic sense (or lack thereof). To this effect, I have made a stylesheet that is designed to override the appearance of Gmail, not necessarily because I think my take on the UI is that much better, but because I wanted to show that it could be done.

One traditional limitation of client stylesheets is that they are applied to all sites indiscriminately, making it difficult to target specific ones. Mozilla 1.8a3 and later do in fact support per-domain CSS rules, but this doesn't help us with my browser of choice, Firefox (starting with version 0.9, Firefox is on a separate branch for stability reasons, and changes such as this one won't be picked up until version 1.1). However we can use the URLid extension, which provides very similar functionality (it assigns the domain of each site as the id attribute of the current page's <body> tag; since all other nodes are children of it, it is easy to restrict rules on a site basis).

The stylesheet itself is nothing special, just a basic makeover with a different color scheme. Its one trick is to leverage Gecko's (decent) support for the :hover pseudo-class and the content attribute to allow a bit more information to be shown about each message when it is moused over in the mailbox view.

When looking at the CSS file, one may wonder how I was able to decipher all of the cryptic two/tree letter CSS class names that Gmail uses. I took a two-pronged approach:

First, I wanted to get a dump of the HTML that Gmail generates, so that I could get a feel for its structure (remember, Gmail generates nearly all its HTML through Javascript, thus a "View Source" command will not reveal much). The DOM Inspector showed that Gmail relies on a few IFRAMES for its different views. Once I had figured out which IFRAME contained the mailbox view, I exported its generated source to the clipboard using a JavaScript one-liner invoked from the location bar: javascript:window.clipboardData.setData('Text', window.frames[0].frames[3].document.body.innerHTML). This step had to be done in IE; while Mozilla also supports copying to the clipboard, the code to do so is much longer. The extracted source could then be pasted into a file, run through HTML Tidy (for easier reading) and examined.

For on-the-spot examinations of the DOM hierarchy and for checking to see which styles were being applied, I used the always-handy Web Developer Extension. The extension also had the added benefit of allowing me to modify and reapply the stylesheet without restarting Firefox via the "Apply User Style Sheet..." command from the "CSS" button.

Writing the actual stylesheet was somewhat tedious since the code is somewhat convoluted, unlike the academic-like cleanness of the CSS Zen Garden markup. Styles are sometimes inlined, thus one is compelled to use the !important value in order to override them. Of course, once you start using that for your base elements, your child elements have to use it as well, lest their properties be overridden as well. However, the number of hacks was kept to a minimum, and Mozilla-only features such as -moz-border-radius and -moz-opacity made life easier.

Getting Pictures onto the v710 #

There's been some hoopla about Verizon crippling the Bluetooth implementation on the Motorola v710 phone. Having recently switched from my T610, I've been struggling with transferring all of my old contacts and pictures (crippled Bluetooth or not, T-Mobile's bad reception was unbearable). Contacts can be retyped, but pictures are harder to reconstruct.

Having signed up for vtext.com, I also had a vzwpix.com account. That account provides a way to upload files and then send (presumably) an MMS message with the pictures of your choice to any Verizon subscriber. However, sending pictures to myself didn't seem to work: the phone would receive the message notification, but it was never able to download the message itself.

I then tried sending pictures as attachments via Mail.app, using 10digitphonenumber@vzwpics.com as the destination. This time the message was fully received, but it claimed that attachments had been stripped, due to AppleDouble being an unknown encoding scheme. Finally, it dawned on me to check the "Send Windows Friendly Attachments" checkbox when adding an attachment (this doesn't seem to be an option when using drag and drop to attach a picture, which is what I tried the first time around).

As a final note, the optimal wallpaper size is 176 x 220 (a bit smaller that the screen size since the status bar at the bottom is always present).

JavaScript Associative Arrays #

There seems to be some confusion regarding associative arrays in JavaScript (i.e. doing searches on the matter turns up many pages giving wrong information). First of all, these arrays (which act as hash tables) have nothing to do with the built-in Array object. They simply rely on the fact that object.property is the same as object["property"]. This means that the length property is not used, nor do any Array methods (such as join) do anything. In fact, it is better to create the associative array using the generic Object() constructor to make this clearer.

The way to iterate over the items in an associate array is to use the for (value in array) construct, allowing you to access each item's value via array[value]. It appears that the order in which properties (i.e. items) are traversed is implementation dependent. The ECMAScript specification is pretty vague on the matter, saying (in section 12.6.4) "Get name of the next property of [the object] that doesn't have the DontEnum attribute. If there is no such property, go to [the end]". Firefox, Safari and MSIE appear to traverse items in the order in which they were inserted, while KHTML (within KDE 3.1) and Opera (at least through 7.54) use a seemingly random order that presumably reflects their respective hashtable implementations.

The iteration order can be tested using a very simple code snippet such as this (click here to run it):

var items = {"dioxanes": 0,  "shunning": 1,  "plowed": 2,
            "hoodlumism": 3, "cull": 4,      "learnings": 5,
            "transmutes": 6, "cornels": 7,   "undergrowths": 8,
            "hobble": 9,     "peplumed": 10, "fluffily": 11,
            "leadoff": 12,   "dilemmas": 13, "firers": 14,
            "farmworks": 15, "anterior": 16, "flagpole": 17};
	
listString = "";
  for (var word in items)
    listString += items[word] + ", ";

alert(listString);

If the list of numbers appears in ascending order, then the browser preserves the insertion order. If you are in fact looking to traverse the object's properties in the order they were inserted in, regardless of browser implementation, you'll have to create a (possibly double) linked list that you can use to jump from object to object.

ourTunes: Search iTunes Shares #

Just like it happened before, my incredible ability to procrastinate (or more precisely, start too many projects) meant that someone else not only had the same idea (not all that surprising really) but implemented it as well.

Commercial Skipping with QuickTime Player a.k.a. Poor Man's TiVo #

With the help of my Formac Studio and Vidi (since Formac's programmers are incompetent and want more money for software that works with the current release of Mac OS X) I've been watching TV on my Mac. Vidi supports a basic recording mode (even with channel switching when using the Studio's TV tuner), but playback via QuickTime Player leaves something to be desired, especially for someone used to the TiVo UI (with the 30-second skip hack of course).

Recalling that QuickTime Player is AppleScriptable, I decided to at least cobble together a 30-second commercial skipping function. A short while later I had the following script:

tell application "QuickTime Player"
	activate
	if not (exists movie 1) then return
	set theMovie to movie 1
	stop theMovie
	set currentTime to theMovie's current time
	set timeScale to theMovie's time scale
	set the current time of theMovie to currentTime + 28 * timeScale
	play theMovie
end tell

(Technically it skips 28 seconds due to human reaction time and the execution latency; this number may need some tweaking). Now that I had the script (which ran as expected when executed from the Script Editor) I needed some way to invoke it from the QuickTime Player. I initially tried adding it to the Script Menu, which although worked as expected, provided no way of attaching a keyboard shortcut. This tip suggested that it might be possible, but even after restarting SystemUIServer (which acts as a host for the Script Menu extra) the key shortcut only worked when the cursor was over the menu extra area.

Presumably a third party solution like iKey could be used, but I wanted to see if I could make do with what I already had installed. It turned out that the Microsoft Mouse software supports attaching applications to the mouse's buttons. Wrapping the above script on a on reopen/end reopen block and saving it as an application that stays open and has no startup screen worked pretty well. The reopen block and the "stay open" option are necessary since launching the script on every click would take too long. Presumably a timer that quits the script after a certain level of inactivity (e.g. QuickTime Player is quit) would be even more elegant, so that the Dock isn't littered with another icon).

Facilitating Blog Navigation #

Disclaimer: This is the result of my idle thoughts on a Sunday night, and thus is not scientific in any way. No actual research was done.

Although some people seem to think that a newsreader is the be-all-end-all way to read blogs, there's really quite a few approaches. Site organization and navigation structure can aid these different reader modes, though the design decisions may not always be easy. Facilitating one may hinder another; trade-offs have to be made. To be more specific, here are a few reader "profiles" and the things that impact them:

Subscriber

This person is subscribed to your RSS/Atom feed, and is aware of every single entry you publish (reading it is a different matter). If you provide a full-content feed and they choose to read it in their aggregator, then there isn't much that needs to be done, beyond making sure that your feed does in fact accurately reflect your site (e.g. for a while I had forgotten to include all of my MTMacro definitions in my feed template). If they end up reading the entry in their browser, it is best to minimize the extraneous clutter that surrounds the entry text. This can be as simple as making sure that each entry has its own page, but it may also involve removing any sidebars, headers or footers that aren't really relevant to that entry.

Periodic Reader

Since the magic that is RSS/Atom hasn't reached all corners of the earth, some loyal readers may still resort to visiting your site periodically, to see what new things you have posted. Or perhaps you don't provide a feed so readers have no choice but to check the old-fashioned way, via bookmarks. Such readers still visit often and thus are familiar with your site's organization; therefore the previous clutter minimization strategy still applies. However, it is also important to make it easy to see which entries are new. Basic things like making sure you have a different color for visited links matter. The traditional reverse-chronological sorting can be somewhat annoying, but assuming that that the last entry that the reader remembers hasn't fallen off the front page, it should be a matter of scrolling down to it and then reading one's way back up. If in fact there have been so many updates since the last visit that all front page entries are new, then more aid is required. The calendar is of some benefit, although it does require the user to remember the approximate date of their last visit. Providing a way to navigate from entry to entry also helps. Perhaps the best solution is to convince such readers to subscribe to your feed, to make both your lives easier.

Off-site Visitor

Depending on which pundit you listen to, linking is the essence of blogs. When a visitor first comes across your site as a result of a cross-site link, they are placed in a pretty unfamiliar environment (depending on how much you deviate from standard templates). The link that induced them visit your site may have provided some context, but that may still not be enough. A sidebar that points to the "surroundings" of this entry (posts in the same category or close chronologically) may help a reader who does not have the background knowledge of a frequent one.

Google Searcher

Someone coming across one of your entries as the result of a search may be considered a subset of the previous reader type. The key difference is that they have nearly zero context, beyond (possibly) the snippet that was in the search result listing. Having individual entry pages is key (assuming the search engine is clever enough to favor those over time or category archives). There are ways to make the searcher's life easier, but in my experience they are of limited usefulness. Search engines themselves strive to solve this problem as well (e.g. Google's cache with keyword highlighting), and in the meantime "surroundings" suggestions apply.

Absorber

Perhaps this is the procrastinator in me speaking, but I periodically come across a blog that seems interesting and focused enough that I want to read it from beginning to end (or end to beginning if I'm feeling adventurous and/or want more timely information first). Facilitating this is as simple as providing previous/next links on the individual entry pages, but it's surprising how some default templates don't support this behavior. The calendar can be used a substitute, but it requires that a new target (the day following this one) be reacquired upon every click, and thus is suboptimal.

This entry goes in tandem with a slight redesign that aims to provide more context, especially for individual entry pages. I had previously used a near-default Movable Type template, which limited itself to previous/home/next links. In the new version, this is replaced by a Jeremy-like sidebar that gives a bit more information. I'm obviously not the first to think about such things, and I don't claim my solution as being optimal, but I now have something that can be iterated upon.

Three Random Tips #

To get Firefox (presumably applies to Mozilla too) to show its JavaScript console at startup, set its homepage to "javascript:"

To get slightly better performance from a VNC server running on a Mac OS X machine, run ShadowKiller. No shadows means fewer updated pixels, and the ones that do change are more easily compressed. Presumably running some kind of theme that has less translucency than Aqua would help even more.

Even if Mail.app is set to remove messages from a POP server "right away" there's some circumstances in which they will be left there indefinitely. Specifically, if you have a rule that auto-deletes certain messages (say, to remove high-scoring spam messages) then Mail.app fetches them from the server, checks them against the rule set, discards them, but forgets to also remove them from the server. Until this bug is fixed, my workaround has been to modify the rule so that it just moves those messages to the Trash and marks them as read. This does mean that I have to remember to empty the trash periodically but 1) I do that already anyway and 2) it's better than ending up with 40,000+ messages on the server and having my webmail client and other not so robust programs die.

Pseudo-Local W3C Validator Favelet #

My to-do list has had a "install local copy of the W3C validator" item on it for quite a while, and when I came across an article detailing how to do just that, I thought I was all set. However, my excitement faded shortly after I saw the steps required to do the installation. A CVS checkout, replacing some files with Mac OS X-specific ones, Apache config file editing, two libraries to download and install, and fourteen Perl modules to setup. I resigned myself to an hour or two of drudgery, and went through the list. I eventually stumbled when trying to set up the Open SP library: I didn't feel like installing Fink just for this one thing, and my attempt at building it by hand didn't quite work out (the libtool that was included wasn't the right one for OS X).

Rather than force myself to go through with the rest, I decided that perhaps an alternative approach was worth investigating. Instead of running my local (behind the firewall) documents through my own validator, I could instead transfer the file to another server, and then point the regular W3C validator to its (publicly-visible) temporary URL. Doing this in the form of a favelet/bookmarklet seemed ideal, since it would provide one-click access and be more portable than a shell script. This favelet would then invoke a CGI script on my server; a hybrid design in the style of my feed subcription favelet.

The first thing that must be done is to get the current page's source code. Initially, an approach based on the innerHTML DOM property seemed reasonable. However, it turned out that this property is dynamically generated based on the current DOM tree, and thus not necessarily reflective of the original source. Furthermore, it's hard to get at the outermost processing instructions in XHTML documents, thus the source wouldn't be complete anyway. Therefore, I decided to use a XMLHttpRequest to re-fetch the page and then get its source by using the responseText property. Unfortunately at this stage Internet Explorer support had to be dropped, since its equivalent ActiveX object didn't seem to want to run from within a favelet (clicking on the identical javascript: link in a webpage worked fine).

With the source thus in hand, I had to find a way to get it on the server. The XMLHttpRequest object also supports the PUT HTTP method, but apparently Safari only supports GET and POST. In any case, the object's use is restricted for security reasons, and so it would've been difficult to make any requests to a server different than the one hosting the page that was to be validated. However, the other, more old-school way of communicating with servers, via a form, was still available. Therefore the favelet creates a form object on the fly, sets the value of a hidden item to the source, and then passes it on to the CGI script. The script generates a temporary file and then passes it on to the validator.

The validator favelet is thus ready to be dragged to your toolbar. The original, formatted and commented source code is also available, as is the server-side script that receives its data and passes it to the W3C validator. The development process of this favelet was made much more pleasant due to the generator that can be used to transform human-readable code into something that's ready to be pasted in a javascript: URL.

Full disclosure: For some reason, perhaps because it was 1 AM, it didn't occur to me to use the POST method to submit the source. Instead I devised a (convoluted) method that would take the source, divie it up into ~2K chunks, and then create a series of iframes that would have as their src attribute the CGI script that took the current chunk as its query string (i.e. via the GET method). Since there was no guarantee that all of the chunks would arrive in order, I had to keep track of them and eventually join them to reconstruct the original source (à la IP packet fragmentation). You'd think that I would have realized the folly (i.e. difficulty in proportion to benefit) of this approach early on, but no, I pursued it until it worked 95% of the time (modulo some timing issues in Firefox). Only when I was researching this entry did I realize that the form/POST approach was much faster (each chunk required a new HTTP connection and a fork/exec on the server), and ended up implementing it in 15 minutes with half the code. Chalk one up to learning from your mistakes (hopefully).

JavaScript DOM Iteration and Date Function Optimization #

Short Version

Instead of row[k], use nextSibling to iterate between table rows in Safari and Firefox. Firefox has slow date functions; cache values whenever possible. Setting node values by using a template node subtree that you cloneNode, modify and insert is faster in Firefox and MSIE whereas setting the innerHTML property is faster in Safari.

Long Version

For my primary ongoing proto-project, I needed to do some transformations on a table's contents, to make them more human-readable. Specifically, I was iterating over all of its rows, picking a cell out of each one, extracting its contents (a date in the form of seconds from the Epoch), converting the timestamp to a nice human-readable string form, and replacing the cell's contents with that.

This all sounds very simple (and it was indeed easy to code up), but the performance I was seeing was not all that impressive. Specifically, here are the runtimes (in milliseconds) on a table with 674 rows in the three browsers that I usually test with:

Safari:1649.9 Firefox:2578.6 MSIE:618.9

Safari refers to the 1.3 beta release with the faster JavaScript implementation while Firefox is the standard version 0.9.1 and IE is of course Microsoft Internet Explorer 6.01. The first two browsers are running on 1 GHz TiBook, while IE was tested on a 1 GHz Centrino laptop.

The hardware may not match exactly, but we are (hopefully) looking for performance differences on the order of tens of percents, not the 2-4x difference that we see in the above times. I decided to investigate further, by trying to see how much time was spent just doing the iteration. This was a very simple traversal of the following manner:

for (var i=0; i < itemsTable.rows.length; i++)
    for (var j=0; j < itemsTable.rows[i].cells.length; j++)
        transformFunctions[j](itemsTable.rows[i].cells[j]);

(I am paraphrasing a bit, but performance characteristics were similar even when caching the two length values). Taking out the call to the transformation function, but leaving in a dummy assignment like var k = itemsTable.rows[i].cells[j] to make sure that all relevant table cells were accessed resulted in the following runtimes (the iteration was repeated ten times to reduce issues with timer accuracy, times presented are still per run and thus directly comparable to those above):

Safari:893.9 Firefox:667.1 MSIE:37.1

As it can be seen, Safari spends half its time just iterating over the elements, while Firefox needs a fifth of its longer runtime for the same task. Only in MSIE's case does the iteration represent a negligible portion. This prompted me to evaluate a different iteration method, one that moves from row to row and from cell to cell using a node's nextSibling property, like this:

for (var row = itemsTable.rows[1]; row != null; row = row.nextSibling)
    if (row.nodeType == 1)
        for (var cell = row.firstChild, i = 0; cell != null; cell = cell.nextSibling)
            if (cell.nodeType == 1)
                transformFunctions[i++](cell);

(The nodeType == 1 comparison is needed since whitespace between rows/cells may be included as text nodes, whose type is 3.) This code snippet (again with the call to the transform function suppressed) resulted in:

Safari:40.0 Firefox:170.0 MSIE:38.9

MSIE is barely affected, while Safari and Firefox see very significant improvements. I'm guessing that the IE JavaScript/DOM implementation is optimized for the row[k] access method as well (perhaps by pre-computing the array when the DOM tree was generated) whereas the other two browsers simply step along k times through the linked list of nodes, every time there's a index request. A bit more digging would reveal whether Firefox and Safari have an N2 behavior for various table sizes, confirming that this was the case. At any rate, this significantly sped up the overall processing time, and now the bottleneck was the actual transformation that the function was doing. To see where it was spending its time, we ran it so that it just did the timestamp to string computation, as compared to doing the DOM operations as well (getting the cell's contents and later replacing them):

Iteration and Computation

Safari:208.1 Firefox:941.6 MSIE:95.1

Iteration, Computation and DOM Operations

Safari:550.3 Firefox:2081.7 MSIE:620.7

It seems that Safari and IE spend more time with DOM operations, while in Firefox's case the split is more even. As a quick change to help with the DOM issue, I changed the way I obtained the cell's contents. Rather than getting its innerHTML property, which presumably requires the browser to reconstruct the text from the node hierarchy, we rely on the fact that we know the cell's contents are plain text, and thus we can get its value directly with cell.firstChild.nodeValue. Running with this code gets us:

Safari:522.7 Firefox:2041.0 MSIE:587.7

A small (and reproducible) improvement, but not significant perceptually. I next decided to focus on the date operations themselves. The string conversion is done in two parts, one for the date and one for the time. In my application's case, it is likely that the same date will appear several times, therefore it makes sense to cache the strings generated for that, and reuse them in later iterations instead of recomputing them. With this particular dataset, the hit rate of this cache was 86.7%. Running the entire thing, we get:

Safari:474.3 Firefox:1839.7 MSIE:571.0

Firefox is helped the most, confirming the previous observation that its date functions are slow. I then realized that once I created a date object, I kept making calls to its accessor functions (getMonth(), getFullYear(), etc.) The small tweak of making these calls only once and remembering their values for later comparisons resulted in:

Safari:463.3 Firefox:1548.3 MSIE:571.0

Firefox is significantly helped yet again, and now for it too, the DOM operations dominate the runtime. As a final attempt in tweaking them, I tried a different approach when changing the cell's contents. Normally, I don't just get the resulting date string and use it; rather the date and time portions are put into different <span>s, aligned on opposite sides of the cell. Rather than generating these spans by setting the innerHTML property of the cell and letting the browser parse out the DOM tree, I attempted to create the sub-tree directly. I first created a static "template" subtree at initialization time. Then, when it came time to set the cell's contents, I made a (deep) copy of this template subtree by using cloneNode and replaced the values of its two text nodes with the strings. Finally, I replaced the original text child of the cell with this new sub-tree. Timing this resulted in:

Safari:805.0 Firefox:1483.0 MSIE:433.7

For the first time we see an optimization that hurts one browser (Safari) while helping the two others (Firefox and MSIE), to significant degrees in both cases. In my case, I decided to simply revert back to the innerHTML method, but it may be worth it to actually support both methods and switch based on the user's browser.

Finally, here are the percentage speedups that all of the tweaks brought (using the fastest time for each browser):

Safari:256% Firefox:74% MSIE:43%

I must say, I'm quite impressed with IE's JavaScript implementation, especially considering how many (perceived) issues there are with its other components, like its CSS engine.

Bookmarklet/Favelet Generator #

I don't know what prolific authors of favelets/bookmarklets do when they code them, but I find the process rather annoying. Having to strip out all newlines, and preferably all extraneous whitespace (for the sake of shorter URLs) gets tedious when revising a script that's more than a few lines long.

As a result, I've come up with the following Perl script that transmogrifies a readable JavaScript source file into a single compressed line, ready to be used as a favelet:

#!/usr/bin/perl -w

use strict;

my $bookmarklet = 'javascript:';
my $inComment = 0;

while (<>)
{
  chomp;
  s/^\s*(.*)\s*$/$1/;         # whitespace preceding/succeeding a line
  s/([^:])\/\/.*$/$1/;        # single-line comments ([^:] is to ignore double slashes in URLs)
  s/^\/\/.*$//;               # whole-line single-line comments
  s/\s*([;=(<>!:,+])\s*/$1/g; # whitespace around operators
  s/"/'/g;                    # prevent double quotes from terminating a href value early
  
  # multi-line comments
  if ($inComment)
  {
    if (s/^.*\*\/\s*//) # comment is ending
    {
      $inComment = 0;
    }
  }
  elsif (s/\s*\/\*.*$//) # comment is beginning
  {
    $inComment = 1;
    $bookmarklet .= $_; # we have to append what we have have so far, since
                        # the line below won't get triggered
  }
  
  $bookmarklet .= $_ if (!$inComment);
}

print $bookmarklet;

print STDERR "Bookmarklet length: " . length($bookmarklet) . "\n";

The contents of your .js file should obviously serve as the standard input of the script, and for maximum efficiency (on Mac OS X) its output should be piped to pbcopy so that it can be pasted into a browser's location bar for easy testing. The length of the favelet is also printed, since some browsers impose a maximum URL length.

Not quite a favelet IDE, but it certainly makes life easier.

All Mac Developers Think Alike? #

Are these people the competition or possible beta testers?

Reshapable Tables #

Despite all the ranting about tables vs. CSS, one place where they are unequivocally suitable is when displaying tabular data (though some people will mysteriously argue even about that). However, despite the fact that native OS controls for displaying such data offer all sorts of features, (X)HTML tables only provide the most basic functionality. Some enhancements have been made, but they are mostly for the sake of semantics or accessibility (<thead>, <tbody>, etc.) The W3C has only given us weakly worded proposals such as "User agents may exploit the head/body/foot division to support scrolling of body sections independently of the head and foot sections." (emphasis mine).

For an ongoing project I need a little more, and so I have come up with this more capable approach. Though it is still feature-poor compared to a native control, it is still a step forward (if we can have a wrapper for NSSearchField perhaps one for the data browser/NSTableView is warranted too?).

Firstly, column headers now stay put while the rest of the table is scrolled. This is primarily accomplished via the position: fixed CSS attribute. However, since IE lacks support for it, an alternative approach (suggested by this page) using dynamic properties had to be used. This basically sets the top attribute to a script snippet that is dynamically (re)evaluated to the document's vertical scroll offset. This does have the drawback that the header row had to be put in a separate table from the <tbody> portion, and in order to get the cells of these two sections to line up, the table-layout: fixed attribute had to be used.

However, there is a good reason to use fixed table layout anyway, since it allows us to implement the second feature, resizable columns. Making the dividers draggable was done by overlaying on top of their borders some <div>s with onmousedown/up/move handlers in a similar matter to the magnifier. However, instead of moving a <div>, we compute the horizontal percentage of the mouse's position and use that to change the table's column widths*. Another trick was to have the draggable column divider <div> change its width when a drag begins. Normally it is only ten pixels wide, centered on the column border. However, once a drag begins it expands to fill up the entire width of the page. The illusion of movement is accomplished by shifting the background position. This widening is to prevent the mouse cursor from escaping from outside its boundaries if it's moving too quickly. Attaching the onmousemove handler to the document's body didn't work with Mozilla/Firefox, since the handler wasn't invoked when it was on top of the bare body (as opposed to another node). Finally, text selections were disabled since the effect was distracting when dragging a column divider. To accomplish this required the use of the -moz-user-select CSS property in Mozilla-based browsers (based in turn on the user-select property in a CSS3 draft). Safari and IE do not implement it, but they do support the onselectstart handler, and setting it to a function that returns false prevents selections in those browsers.

Speaking of specific browsers, it should be clear that I made all efforts to make sure that this table implementation is compatible with Safari, Mozilla, Firefox and IE 6. Since the JavaScript simply adds behaviors to a static table, the page degrades well when scripting is disabled. Its behavior in older or accessibility-oriented browsers should also be minimally impacted.

Incidentally, a bit of Googling turned up the fact that apparently the resizable columns feature (as implemented for IE only) is worth $24.95, so clearly this entry is quite a bargain.

* For performance reasons only the header widths are updated dynamically, with the rest of the table being refreshed only at the end of the drag. To update everything dynamically, it is only necessary to set the argument to UpdateColumns in ColumnGrabberMouseMove to true.

Gmail vs. Grendel #

About the Gmail engine and protocol

Whenever you log in to Gmail, a copy of the UI engine is loaded into one of the HTML page frames and remains there for the duration of your session. [...] HTTP requests to the Gmail server return the “DataPack”, a base HTML file that contains only JavaScript array declarations that the UI engine parses and then uses to determine what to update.

This journal, on December 12, 1999:

The main thing that I've done with Grendel was do implement the hybrid JavaScript/Perl system. When getting the messages from the server, the script generates a JavaScript which creates new JS objects (in the buttons frame). Then I have a local array of Message objects. To display them, I simply loop over them and call their Display method. The idea here is that if I want to change the sorting method, I can do all of it locally, instead of having to go through the server. Things like deleting messages are faster too. Instead of having to get the new message list from the server, I can delete the specified array memeber locally, and then tell the server which messages has been deleted, and update the message view page with the next one.

If only I could've told my 18 year-old self that my approach was to be validated by a company such as Google. I'm not sure whether it's a good thing (the platform is stable) or bad thing (the platform is stagnating) that 5 years down the line the best practices for web apps haven't changed all that much.

Bus Times On Your Cellphone #

Cellphone showing bus timesPerhaps it's a sign that my middle class life is becoming too cushy if bus schedules are a significant enough annoyance that I decide to do something about them. Then again, we are talking about the bus (and not my private limo) so I can't be too spoiled. At any rate, this scenario is very representative of the problem:

After a day in the city, I am heading back to the Port Authority terminal. A decision must then be made: should I hurry to the gate on the chance that there's a bus waiting there, or should I take my time, stop off at the Au Bon Pain, restroom, etc. The bus does in fact have a schedule, but it is a hassle to remember to bring it, pull it out at the appropriate time, and locate the next departing bus in a page full of numbers. The annoyance is doubled by the fact that there two bus lines which I could possibly take, thus there is a need to check two schedules, carry two pieces of paper, etc.

In theory the solution would be simple: an electronic display at the entrance to the terminal, showing the next departing bus from each gate. Perhaps even a NextBus-powered one, but that would be asking too much. In the absence of such a convenience, I am forced to take matters into my own hands. The one item that I am much less likely to forget to take with me is my cellphone, thus a solution involving it seems reasonable. The schedule is in fact available on-line, so an approach similar to the one I chose for my cellphone-based dictionary is a possibility. However, the latency of sending out a SMS message and waiting for a response is too great for a time-critical situation such as this one.

A non-networked, local, approach is therefore preferable. My T610 is Java-enabled, supporting MIDP 1.0, so writing a simple Java app seems reasonable. Searching turned up this article, showing how to do MIDP development on Mac OS X. Compiling the "Hello World" example worked fine, modulo the need to not have any spaces in the path to the MIDP installation. As the article mentions, deployment can be done via Bluetooth, and indeed transferring the JAR file resulted in the application showing up in the "Games & More" category. Transferring the JAD didn't seem to work however (the phone didn't know what to do with it), so I had to give up the use of resources/properties that the original "Hello World" app made use of. Considering that this is a one-off application, it's not that big of a deal.

All that was left was to actually write the application, which I cobbled together from this article that explains the basic GUI concepts, while the MIDP 1.0 spec filled in the details. The only gotcha encountered was that, when running on the emulator, the Calendar instance returned by Calendar.getInstance() was set to GMT time. However, when running on the phone it did use the local timezone, so only testing was made a bit more difficult. There was also the minor annoyance that the Mac/phone Bluetooth pairing would be lost after every transfer, but I'm not sure where the blame for that lies.

The BusTimes.jar is ready for deployment on a phone, while the complete source package is available as well (the picture shows roughly what to expect when running it). There is support for Saturday/Sunday schedules, but holiday computation wasn't added. In the event of a schedule change, things may get a bit tedious, but that doesn't happen too often.

Now, if only the J2ME implementation in my T610 supported access to Bluetooth and IR. But I guess that's always an excuse to upgrade.

Early Bird... #

What seems to happen when you take too long on your pet project is that someone else gets to it ahead of you. Though perhaps, in this particular case, there's room for improvement.

NetNewsWire Subscription Favelet #

As I was pondering the implications of the Safari RSS announcement (my thoughts pretty much mirror Brent's) I realized that a web browser/feed aggregator combo does have one thing going for it. Subscription to a site right now requires at best a drag-and-drop of the site's URL into NetNewsWire, and at worst looking through the page (or its source) for the (possibly orange) RSS/XML/Atom icon. The blue logo that Safari RSS adds to the location bar makes it much more obvious when a site has a feed, and this is some functionality that NNW doesn't have. It could perhaps be added to current versions of Safari with an InputManager extention á la PithHelmet or Saft. Clicking on it would subscribe to the site's feed using NetNewsWire, or whatever aggregator was registered as a handler for the feed:// protocol. However, this requires a level of Cocoa-based hacking that I'm not familiar with and aren't prepared to learn just yet.

As a hack-ish alternative, a favelet could be used to at least allow one-click subscribing to a site. Such favelets/bookmarklets already exist, but they rely on RSS Auto-Discovery. The problem with that is that it is a relatively brittle approach, i.e. it fails completely if a site doesn't use auto-discovery, even if it may have a feed. Mark Pilgrim's Feed Finder tries much harder to find the RSS or Atom feeds that relate to a page, while doing elegant things like respecting a site's robots.txt file. Re-implementing its functionality in JavaScript didn't seem like an appealing option, so instead I used a hybrid approach. This simple CGI script lives on my server, and calls Mark's tool (I am a total Python newbie, thus there may be better ways of achieving this):

import os, feedfinder

if os.environ.has_key('QUERY_STRING'):
  uri = os.environ['QUERY_STRING']
  try:
    feeds = feedfinder.getFeeds(uri);
    if (len(feeds) > 0):
      print "Location: feed://%s\n\n" % (feeds[0])
    else:
      print 'Content-type: text/html\n\nNo feed(s) found at the given URL.\n' 
  except IOError:
    print 'Content-type: text/html\n\nCould not access given URL.\n'
else:
  print 'Content-type: text/html\n\nNo URL was given.\n'

The Subscriber favelet then simply invokes it with the current page's URL. If a feed is found, the redirect within the CGI script triggers the feed:// protocol handler thus invoking the aggregator. Such a favelet has the advantage of being easily updatable; if the auto-discovery standard were to change or if Mark figures out even more clever ways of determining if a site has a feed, then all its uses would immediately benefit.

Custom Draggable Frame(set) Borders #

Frames, as represented by the <frameset> and <frame> tags, have been put in the DO NOT column by many people since the day they were introduced, so many years ago, with the advent of Netscape 2.0. Initially this was with good reason; broken implementations meant that the back button and bookmarking did not behave as expected in their presence. However, browser support has advanced to the point where they are not serious hindrances to navigation, and in some cases (say, when implementing a web version of a multi-paned app) they represent the best tool at hand.

With their need and usefulness thus established, we can turn to more superficial matters, aesthetics. Frame borders are one of the few areas where a web designer has minimal control. Netscape, in its day, supported a proprietary bordercolor attribute, but this has been dropped from the equivalent XHTML module. In any case, this was coarse control over the color of the border, and nothing more.

The obvious workaround is to create another frame of a narrow width (or height) and within this fake divider specify the border appearance as desired. The trade-off was that frame resizability was lost - at best the user has to drag two sets of borders (on either side of this faked divider). Fortunately, support for the browser DOM has advanced to the point where we can do something about it.

Roughly speaking, we can use the same method that enabled us to make the draggable magnifier. In this case, instead of moving a <div> in response to mouse move events, we change the frameset's cols property (or rows, as appropriate). Additionally, while we previously used an event's pageX/Y fields to determine the mouse's location, since the page itself is now shifting, these would not give us accurate data. Instead, we can use screenX/Y to get the coordinates relative to the user's screen, something that won't be changing on us.

The implementation then becomes very simple, as this example shows. There is the caveat that Firefox seems to have issues if the mouse is moved too fast (if it escapes beyond the boundaries of the divider frame, it stops receiving events), but Safari and IE 6 have no problems with the technique. A possible workaround to the Firefox limitation is to use inline frames and then attach the event handler to the root document, but this may not degrade as nicely (the given approach would look OK even with the oldest frames-supporting browser (albeit with no resizing support), while iframes are a more recent development).

RSS-based PageRank™ Monitoring Tool #

Short Version:

Recent revelations have made it very easy to determine the checksum necessary to request a URL's PageRank from Google. Some people, for commercial, egoistical or other reasons, are interested in knowing when the PR of certain URLs changes. As a result, I've whipped up pagerank2rss.pl, a simple Perl script that outputs (in the RSS format) the PR of a list of URLs. Simply drop it in your web server's cgi-bin directory and point your aggregator at it. To change the URLs that it monitors, open it up in a text editor, and modify the %pageRankURLs hash with the URL and its checksum, as computed with the help of this site.

Long Version:

It is argued that Google's PageRank is dead. Though that debate is still unsettled (and will continue to be so, unless Google states outright that they are discontinuing it), one PageRank-related thing is in fact dead. Until recently, the only way to get a site's PageRank was to use the Google Toolbar, which relied on a private "channel" to find out the information from Google's servers. This appears to consist of a private URL to which one can pass as arguments, a URL and its checksum, and in return receive that page's rank.

Lately, this (obfuscated) barrier has been dented, with the advent of Prog (née Proogle) and the attempted auction of the (reverse-engineered) algorithm on eBay. This culminated today in the public domain distribution of an implementaion.

What Google will do about this (if anything - Prog has been up and running for a while, and the eBay auction was not stopped, though thankfully no one was clueless enough to bid either) remains to be seen. Regardless, in the meantime there are a few interesting things than can be done with this bit of information. The first that came to my mind was a simple script that monitors the PR of URLs and reports the results as an RSS feed. Hooked up to an aggregator, it's now possible to see when an URL's rank changes. Presumably this is old news to SEO-types that already have similar tools for doing this, but it may be useful to those that need their egos stroked on a regular basis.

The net result is pagerank2rss.pl, a Perl script that can be dropped in a web server's cgi-bin directory. It relies on curl, since that's available by default on Mac OS X, but with a few tweaks it could be made to use wget or LWP. The list of URLs to monitor is stored in the %pageRankURLs hash, along with their computed checksums (using this handy web interface). It is obvious that this is a 20-minute quick and dirty script - I could've computed the checksums myself and the list of URLs could be made accessible via a friendlier web interface. However, given the limbo-ish status of the methods that it uses, I figured it would not have been worth it to spend more time on something that might go away soon.

Spam Tidbits #

Spam message with the subject "mcpherson"It seems I'm not the only one doing visualizations of my spam flow in an attempt to figure out a solution (my various attempts).

I have also been investigating my proposed solution (challenge/response) given my situation (several thousand messages a week). Some searching on freshmeat.net turned up Email Secretary, a Perl/MySQL/qmail/vpopmail implementation. Except for the latter dependency, this matches my set-up, and thus getting it running should be doable. The fact that it's written in (reasonably clean) Perl will also hopefully make it easier to make the tweaks that I want (automatic deletion of messages with a SpamAssassin score above a certain threshold, white-listing based on more than just sender's address, etc.)

I have also found a few essays on challenge/response systems, which seem to contain enough guidelines and real world experience to know what to avoid.

Finally, as the image shows, it appears that, as a respected scholar, you haven't really achieved the acclamation you deserve until your name is used to dilute the spam quotient (and baffle the readers) of junk mail.

Mail Scripting: Part Deux #

Now that I have determined what the best (i.e. fastest) way to get Perl talking to AppleScript it is, it's time to actually get some data out of it. My first attempt was very simplistic: have a handler export all the data that I needed as a string (in this specific example, we're getting the list of mailboxes, but a similar approach can work for getting any sort of data out of the script):

on «event MailGMbx»
    set output to ""
    tell application "Mail"
            repeat with i from 1 to mailboxCount
                set output to name of item i of allMailboxes & "\n"
            end repeat
    end tell
    return output
end «event MailGMbx»

This handler can be invoked with Mac::OSA::Simple's call method, and the result can be extracted by using Perl's split command. Some benchmarking revealed that this had reasonable performance for this particular case (there never are too many mailboxes), but that it degraded severely when dealing with more data. For example, getting a list of all the subjects, senders, dates, etc. of a mailbox with ~1000 messages took around 10 seconds. I initially thought that this was due to poor string handling in AppleScript (much in the same way that the String class in Java isn't meant for repeated appends, leaving that the string building tasks instead to StringBuffer). However, even not doing any appends in the loop (simply setting output to the current value) had similar behavior, performance-wise.

Since Mac::OSA::Simple's call function can handle lists as result values, I tried returning the name of every mailbox directly instead. Much to my surprise(or perhaps not - doing the iterations at a lower level was bound to be faster), that worked much better. In the mailbox case, the string building way took an average of 0.115 seconds per handler invocation, whereas returning a list directly took 0.0472 seconds. In even more demanding situations, such as the above-mentioned mailbox headers case, it was an order of magnitude faster.

Deviating a bit from the above, I also attempted a different method altogether. In my Googling yesterday, I came across Mac::Glue, an alternative way of bridging the Perl-AppleScript (or more correctly Perl-AppleEvents) gap. The module very cleverly loads the scripting dictionaries of applications, and creates a bit of glue code that allows interaction with AppleEvent objects directly from Perl, as described in this article. Thus there is no need to have an external script that is invoked from Perl; everything can be done from The One True Language. For example, the above mailbox-extracting case would be done as follows:

my @mailboxes = ();

my @mailboxesAE = $mailApp->prop("mailboxes")->get;

my $i = 0;
for my $mailboxAE (@mailboxesAE)
{
    my $mailboxRef = {id => $i++,
                      name => $mailboxAE->prop("name")->get};
    push(@mailboxes, $mailboxRef);  
}

return @mailboxes;

However, it turns out that, despite the greater ease of use (and flexibility) that Mac::Glue allows, it's not going to be feasible to use it in my project, for two reasons. The first is that performance seems to be much worse; the above code runs at 1.97 seconds per call, an order of magnitude slower than even the string based method above. More importantly, it also has memory leak issues, with the above example losing ~2 megabytes per call (there were 52 mailboxes in the list). Since I will be using this from within a daemon process that has a long lifespan, this would not be acceptable.

The developer's journal has the answer as to why this is the case. Mac OS X changed the way data could be extracted out of AppleEvent descriptors (AEDescs). Rather than simply getting a handle to the descriptor's data portion, one must now allocate some memory and request that the data be copied there (with the original being out of reach). Mac::Glue, having started life as a Mac OS 9-era MacPerl module, does not fully take this into account yet. That is, it does the copying necessary, but it never disposes of the data once the Perl object goes out of scope or is otherwise garbage collected. Until this is fixed, this module is of limited use to me.

Perl, AppleScript and Mail.app: Prototyping and Benchmarking #

I am investigating a new product for Mscape, since working on the same thing for six years can get somewhat boring. This would involve exposing Mail.app's mailboxes and messages with an alternative GUI, thus I needed to figure out the best way to get data out of it.

The most direct way would be as a plug-in (a.k.a bundles). However, this seems to be an undocumented (and unsupported) approach that relies on private Mail.app methods. There are several plug-ins that exist in spite of this, and presumably their developers have a vested interest in keeping the reverse-engineered headers up to date, but there is no absolute guarantee that this approach will always work. Furthermore, there appear to be some complications with this approach, licensing-wise. Most of the plug-ins are open source, specifically under the GPL. I'm guessing that this prevents me from grabbing their headers and incorporating them into my (closed-source) product. httpMail appears to be under a BSD license and may thus be more permissive about such things, but it's still a messy thing that I'd rather not deal with.

An alternative approach would be to rely on Mail.app's AppleScript support, which has the advantage of being officially supported by Apple, with the trade-offs of lower performance and less functionality. Since, at least in the prototyping stage, I'll be working with Perl (I don't relish the idea of writing something entirely in AppleScript), I went off on the tangent of investigating the best way to tie these two scripting languages together. The simplest way would be to invoke the osacript program, much in the way that I did for my blogroll generator. However, this has some pretty significant overheads, specifically having to spawn another process and compile the script. Some searching turned up a post benchmarking the various Perl-AppleScript glue mechanisms. The fastest appeared to be the DoAppleScript method from the MacPerl package. However, the post mentioned that Mac::OSA::Simple supported pre-compiled scripts as well, something that was not tested. Using this functionality is very simple; it is simply a matter of using the compile_applescript function, and then later calling the execute() method on the resulting object. Incorporating this in the benchmark gives us the following test results (averaged over three runs):

runscript:       42.83 executions/sec
applescpt:       306.66 executions/sec
osascript:       312.33 executions/sec
doscript:        410.00 executions/sec
applescpt_comp: 1108.67 executions/sec

As it can be seen, the compiled version is by far the faster one (the fast osascript times are deceiving, as the post mentions, in fact its speed is closer to that of runscript). However, the AppleScript snippet that is used for benchmarking is very simple (asking the Finder for the name of the startup disk). Running a more realistic script (that iterates through Mail.app's mailboxes, picks one, and returns the subjects of the 131 messages contained within) gives the following results:

applescpt:      6.35 executions/sec
doscript:       6.63 executions/sec
applescpt_comp: 6.72 executions/sec

The spread is much tighter, and so any method should be satisfactory. However, the pre-compiled, Mac::OSA::Simple, approach is still preferable, since in addition to its slight performance advantage, it also provides some handy methods, like the ability to invoke specific handlers within a script. The only approach that may provide better performance and more flexibility would be to use AppleEvents directly, via Mac::AppleEvents::Simple, but this would be much more tedious, since I would have to build them (the events) by hand.

YASC: Yet Another Spam Chart #

SA Score HistogramGrabriel Radic made the observation that it is indeed possible to make Mail.app differentiate between messages based on SpamAssassin score (in reference to my previous entry bemoaning the need for better filtering). Specifically, SA adds a X-Spam-Level header, whose contents are the message's score represented as asterisks (one point = one asterisk). By creating a rule that filters on this header (the "Edit Header List..." command allows filtering based on custom headers), it is possible to do things per score (e.g. color messages differently, or even delete the outright).

The latter possibility interests me the most at this time, since anything that reduces the number of messages I have to deal with helps (though this would still be a stop-gap solution until I have time to implement my challege-response solution). The question then becomes, what should this threshold be set to? I hacked up a Perl script that uses Email::Folder to parse my Junk mbox and generate (via Excel) a histogram of spam scores. The figure shows the results, with the quartiles (roughly) being highlighted. For starters, I have decided to set the threshold at 11, which (given this week's distribution) would remove about half the messages outright. This would still leave me with about 8,000 messages per week, so I may have to be even more stringent.

libopendaap and Mac OS X: follow-up #

It turned out that my hypothesis as to why libopendaap works on (x86) Linux but not on Mac OS X was correct - it is an endian issue. Thanks to a tip from jfpoole, the problem was traced to authentication/md5.c, specifically lines 47 - 51 that attempt to pick a byte reversal routine (if any). Upon running the ./configure script, WORDS_BIGENDIAN is #define'd to 1 (line 72). However, line 58 of authentication/md5.c does a #ifndef WORDS_BIGENDIAN test to see if the byte reversal routine should be set to a dummy one. By changing this to #if WORDS_BIGENDIAN and rebuilding libopendaap, one is able to successfully connect to severs. This still does not fix the auto-discovery issue (given its non-deterministic behavior, I still think it's a threading issue), so I'll still have to use CFNetServices for that.

On the Java front, One2OhMyGod became orphaned, and has since been resurrected as AppleRecords. A new release of the latter incorporates the necessary changes in order to support connections to iTunes 4.5 clients, thus this seems like a viable option as well. However, last I checked building this seems to be a pain (since it requires libraries like JRendezvous, a.k.a. JmDNS and JavaLayer). I also haven't looked through the source, to see how well separated the DAAP connection stuff is from the GUI and MP3 playback code.

My Plan For Spam #

Chart of spam levels (increasing trend) A couple of months ago I posted an entry describing my spam filtering status. At that point, I was very happy to see that my weekly level had declined from ~4,000 to ~2,000 messages. That decrease coincided with a net-wide downturn in spam levels, and was additionally boosted by some tweaks I had done to my rules.

I'm sorry to say that the respite was purely temporary; as the chart shows I'm now getting ~16,000 messages a week. Things have gotten to the point where filtering attempts are breaking down. For example, a recent spammer tactic is to stuff messages with random words in attempt to dilute the overall "spammy-ness." Although, as described by Paul Graham in his FAQ, this won't get around Bayesian filters, it does have another side effect. With enough of these messages being received (i.e. in my situation) then these filler words also begin to acquire a positive correlation with spam messages. With my corpus thus "poisoned," even completely innocent messages will be marked as spam.

The net result of this is that I've been forced to lower the weight of the Bayesian rule within SpamAssassin so that by itself it is not enough to bring a message's score over the spam threshold. I am now forced to depend more heavily on the other rules (that look for broken headers, certain words, etc.) that SpamAssassin provides. However, since these rules are universal (and publicly available), spammers can (and have begun to) tune their messages against them to make sure that none are triggered. This explains the decrease in effectiveness in the latter half of May.

I have also noticed other tricks, notably a couple involving the "Subject" header. I have set SA to prepend "**JUNK**" to the subject of any message over the threshold, and then by sorting by subject Mail.app's Junk folder (I'm using its built-in filtering in combination with SA), I can see what was marked by SA. One way to work around SA's marking is to not have any subject header at all, in which case it appears to do no prepending at all. Another way is to have two headers, in which case SA modifies only one of them. It just so happens that Mail prefers the opposite one from SA. The net result in both cases is that these messages are not sorted with all of the spam, and thus I'm forced to check them by hand (since Mail.app is less forgiving than SA, those that SA doesn't mark as junk have to be checked by hand for false-positives).

I could probably make some tweaks to make the situation more bearable (e.g., fix SA to deal properly with the Subject tricks, play around with the rule scoring a bit more, etc.) but that still wouldn't work in the long run, and it still would do nothing about the torrent of spam messages that ends up in my (Junk) mailbox. Therefore I have in mind a more drastic solution:

Ideally, this would involve shutting down mscape.com's email access entirely, and moving to another domain (e.g. this one). However, that's not feasible due to the number of copies of Iconographer that are floating around and contain email addresses at mscape.com (in addition to other places and people that may have address there). The next best thing is to switch to a challenge-response system, a la Mailblocks. It would probably have to be a home-grown solution, since there are a few tweaks that I'd want to make in order to make the transition as painless as possible. First of all, I'd want the entire thing to reside on my servers, since solutions that involve forwarding or redirection would mean an increase in traffic and generally seem rather brittle. I would also have to do extensive whitelisting, supporting not only addresses (ideally I'd upload my "Sent" mbox and extract people's emails from there) but also keywords (e.g. all emails containing "Iconographer" and such words). I don't know if there are any open source solutions that I could build on, but if not, it should make a fun summer project.

Email-based Dictionary Service #

Short version:

Want to look up words on the go? Send an email (from your cellphone, smartphone, Blackberry, pager, etc.) to dictionary [at] this site's domain with your word in either the subject or the body, and in return receive the word's definition. Low-tech (no SOAP, XML-RPC, WAP, WS, etc. here), and yet handy and It Just Works™. Want to run the script yourself? See dictionary.pl.

Long version:

I have wanted to look up words in locations where a dictionary is inaccessible often enough that I have decided to do something about it. My cellphone is usually within reach, and in theory always has signal (in practice, I may need to switch from T-Mobile to Verizon in order to make that assertion hold). Modern phones may have a WAP or even a web browser built-in (especially those of the smartphone variety), but that's not something I can rely on (neither do I relish having to do anything with WAP). However, almost all phones sold within the past two years (mine included) have some kind of SMS functionality, and usually by extension some form of email access (via an SMS or MMS gateway, if not an outright email client). The ideal solution is therefore to have a service listen to some email address, and upon receiving a message, have it return the definition(s) of the word(s) contained. This would be simple enough to implement and at the same time accessible by the most devices (Blackberry handhelds and two-way pagers included).

Leaving aside the "listening to an email address" part for the moment, we need to parse a received email, look up words, and send a message in return. Much in the same way that we approached the NNTP to RSS bridge, we will leverage existing Perl modules and only write some glue between them. Specifically, Mail::Internet and Mail::Header allow us to do the message parsing, while Net::Dict allows us to interface with RFC 2229-compliant dictionary servers. Specifically, we will use dict.org, which has access to eleven dictionaries, but any other server would work just as well. Since we're trying to minimize message size (SMS gateways have ~140 character limits), we only pick one definition to return (dictionaries are ranked in order of preference), and do some clean-up to strip out unnecessary whitespace. dictionary.pl is a Perl script that does all this plus some other clean-up (e.g. signature removal) and sanity checks.

Now we need to make it so that this script is invoked when a message at a specific address is received (with the contents of said email piped to it). This server happens to be running qmail, and so we can use a .qmail file for this. Specifically, I created the file .qmail-persistent:info-dictionary and within it I had the line |/fullpath/dictionary.pl (this discards the message after processing, but it is possible to add a second line with a mailbox path so that it would be received normally as well). Other email systems (sendmail, etc.) presumably have similar mechanisms, but I have not checked.

The net result is that I can now send email to dictionary [at] this site's domain and receive word definitions from wherever I am. An obvious extension would be to support SMS directly (thus broadening accessibility even more), but I'm not sure how exactly I'd approach that (there are free gateways for sending SMS, but receiving is seemingly trickier - I'd have to get a cellular modem presumably).

Apologies for having the email address as an image, but since I have gotten to the point of receiving 10,000+ spam email a week, I am forced to recourse to such extreme measures (the previously-observed decrease was very short-lived).

ImageMagick Tips #

As I continue working on the site that I developed the magnifier for, I've reached a point where the basic design is pretty much in place, and the time to automate the process has come. The 160 or so bindings are divided up into 26 cases. In addition to each case page, each binding has its own magnifier page (I realize that I could have only one magnifier page that takes the images to use as arguments, but this would have meant requiring either JavaScript or a CGI script, neither of which was an option). To generate the HTML as well as all the images, I turned to Perl and the Image::Magick module that packages ImageMagick's functionality in a programatically-accessible manner.

Getting ImageMagick performing in quite the way I wanted required a few tweaks. I was using a pre-packaged build of ImageMagick for Mac OS X. However, this is of version 5.5.7, while the latest version (of the main program as well as of the Perl module) is 6.0. Upgrading my ImageMagick install wasn't something that I relished doing, and thus the only other alternative was to get the older version of the module (since I'm not sure if it's possible to do this through the CPAN shell, I downloaded and installed it by hand). The module requires a few libraries to be present, specifically libjpeg, libpng, libtiff, and liblcms. This guide details how to build and install the first three. Note that it points your directly to downloadable archives; browsing through the linked sites may yield more recent versions. liblcms (Little Color Management System) is also downloadable and builds fine without any modifications. Installing all of these libraries in a consistent location such as /usr/local is helpful when tracking them down later. After they are all in place, Image::Magick should build and install fine.

There is some documentation for Image::Magick, and although it could be a bit more thorough (e.g. what do some of the more obscure scaling filters do), it's generally sufficient. One issue encountered when saving JPEG's was that their quality was abysmal when compared to Photoshop's (i.e. at the same file size, compression artifacts were much more visible). The library itself doesn't do the compression, rather (as expected) it relies on libjpeg for this. As pointed out by Alexey, other programs that also use libjpeg, such as GIMP save JPEG's that are comparable in quality to Photoshop's. The difference appeared to be that ImageMagick didn't make use of libjpeg's optimized mode (as specified by the optimize_coding flag). The final solution was to have Image::Magick save the file in a lossless format such as PPM, and then use the cjpeg CLI utility (provided as part of libjpeg) to do the JPEG compression. It supports varying levels of quality, optimization, and precision, and thus, with the right amount of tweaking, provides much better results. The actual code looks something like this:

sub SaveJPEG
{
    my ($image, $path, $quality) = @_;
	
    $image->Write('/tmp/out.ppm');
	
    `cjpeg -optimize -dct float -quality $quality /tmp/out.ppm > "$path"`;
}