Spam Filtering #

When I put up this site (and its associated domain) I had the opportunity to start over clean with regards to the web propagation of email addresses (my main address, , is all over the web, and receives a ridiculous amount of spam, as will be shown shortly). The address I chose (shown in the sidebar) was brand new, and so I had the opportunity to test the efficacy of various encoding methods. This report suggested using HTML entities instead of plain text to encode email addresses, and using this encoder I followed suit.

Today I received my first spam at this domain (incidentally a Nigerian variant). This could be the result of spammer crawlers getting smarter, or my carelessness in leaving a comment with my email address in in. It is rather unlikely to be the former, since logic would have it that spammers should not bother to go after anyone who goes out of their way to make their address un-crawlable, there being little chance of said person wanting to buy anything the spammer has to offer. However, even though the second option seems more feasible in the long run, I'd very surprised that it was picked up less than two days after the page went up. Google found it, but I have the feeling that spammer spiders don't have the adaptive crawling frequency that Google has engineered. In any case, to be safe rather than sorry I've switched to having my address as an image. This still doesn't take care of the comment page link, and so I may have to get a different address, but that remains to be seen.

Chart of spam levels Also on the topic of spam, today was my weekly spam cleanup day. I rely on a combination of SpamAssassin (provided by and's built-in filtering. This takes care of nearly everything, this high accuracy being no doubt due to the large corpus that I have built up. Every week I go through my junk mail folder, see if there were any false positives (SpamAssassin is set to a high enough threshold that I have only seen a couple thus far, but has its weak moments) and then upload all of the certified spam messages to my server so I can train SpamAssassin's Bayesian filter on them. While I'm at it, I also keep track of the total amounts of spam that I get, and how well SA fared.

There's been a couple of key events, as the chart shows (I don't always meet my weekly deadline, which explains the uneven spacing between datapoints). At some point in mid-February upgraded SA to version 2.63 (I didn't take note of what they were running before), and accuracy went way up. I expected that as time goes on spammers will start to tune their messages against this newer release too, and accuracy should go down again, but I have yet to see it happening. The second thing was a conscious effort on my part to filter truly bogus email accounts. Anything sent to the domain ends up in my mailbox, and over the years not only have I gotten things that were not meant for me, but I was also stuck with other people's spam. After automatically deleting the most obvious offenders (before they reached SA) I saw another sharp drop. This coincided with AOL reporting decreased spam levels in mid-March, so I'm sure part of the decrease is due to that, but still, some credit must be given to my measures. The overall level is still ridiculous, and if so many copies of Iconographer with my email address weren't out there, I would probably drop , but as of now I have little choice.

Post a Comment