In How Many Ways Can an URL be Mispelled? #

As I was looking through this site's access logs this week, I noticed that I was getting a lot of failed requests for what looked like attempts at getting at my Gmail skinning post. The strange thing was that these 404s had no referrer, diverse user agents and IPs, and were attempted between one and five times. At first I suspected a (shady) crawler run amok, but the variety of IPs made that unlikely. I then briefly wondered whether a worm could be responsible, but if my little site was getting so many requests, then presumably it would've been noticed by other people as well. Since most of the failed requests were one or two characters off from the real URL, I wondered if traffic was getting subtly corrupted. However, no relevant outages were mentioned, and since the requests were spread out over a few weeks, it's unlikely that such a thing would've gone unnoticed.

For the curious-minded, the relevant access log snippets are here. Below are the top 10 (by frequency) 404-causing requests:

  1. 509: /archives/2004/10/05gmail-skinning
  2. 160: /archives/2004/10/05gmailskinning
  3. 20: /archives/2004/10/05/gmail-skinning/
  4. 16: /archives/2004/10/05/gmailskinning
  5. 14: /archives/2004/10/05gmail-skinning.com
  6. 12: /archieves/2004/10/05gmailskinning
  7. 10: /archieves/2004/10/05gmail-skinning
  8. 9: /archives/2004/10/05gmail_skinning
  9. 9: /archives/2004/10/5/gmail-skinning
  10. 8: /archives2004/10/05/gmail-skinning

The only theory that is consistent with all the facts is that the post was mentioned in some print publication, and that the printed URL was incorrect. That would explain the lack of referrers and the diversity of IP addresses and user agents. I assume the URL was wrong since the most popular failed request has an unlikely typo. Users are sensitive to slashes and would not miss one; the second most popular failed request shows a more natural mistake - skipping a hyphen.

I've now added a redirect from the top two items in the list above, though it may be too late. But more importantly, despite my attempt at clean URLs, this shows that they are not as friendly as they could be. The year/month/day hierarchy may be the cleanest, but I could probably get away with just the year, since my entry keywords rarely collide. Perhaps more modern blogging software than my 2003-vintage Movable Type 2.64 installation can do better, but this incident doesn't provide the activation energy to investigate further.

Update on 6/10/2005: It turns that the entry was indeed mentioned in the June 2005 issue of Popular Science. Yay for deduction.

Post a Comment