Despite the fact that the regular readership of this little weblog comprises probably fewer than a dozen people all together, I get a rather astonishing amount of comment spam. For those of you unfamiliar with the idea of “comment spam,” the idea is that advertisers pay unscrupulous programmers (spammers) to try to post comments in response to blog postings, and those comments contain the text of their advertisements along with links to their web sites. These fake “comments” have little or nothing to do with the content of the blog posting itself; the point is simply to put advertising copy in front of readers’ eyeballs.
Comment spam is kind of annoying for the owner of the blog. Since many blogs are based on the same underlying blogging software, these spam postings are usually automatically generated by software agents called “spambots,” whose job it is to find web logs, identify their postings, and post these fake comments in reply. You could avoid the problem completely if you simply turned off the commenting feature, but then you couldn’t get comments from your human readers either, and a lively discussion is at least half the fun of blogging. If you want to permit real human posters to continue to write comments, without succumbing to spambots, you have to be a little more clever.
One solution is to have a human act as a moderator for all comments. In other words, somebody has to basically “take one for the team,” and read each comment that is posted to determine whether it is worthy of appearing on the site. It’s usually not too hard to distinguish on-topic human prose from spam comments, so this method is quite reliable. But it requires a lot of boring and tedious effort on the part of a human, for very little reward. Your prize, if you moderate perfectly, is a blog without any spam. Not exactly the most exciting possible outcome. Still, it’s quite effective, so many bloggers do at least some moderation of their comments, myself included.
Another solution is to try to get the computer to moderate for you. It’s hard for a computer to really know whether there’s a human on the other end of a comment posting a priori, and it’s practically impossible for a computer to assess whether a comment is “on topic,” but if you can require the comment-writer to answer some question or take some action before posting that would be difficult for a computer program to accomplish, you can at least draw a line in the sand between the human readers and the software spambots. Thus is the reasoning behind the unfortunately-named “CAPTCHA,”* invented by researchers at CMU and IBM back in 2000. A typical CAPTCHA implementation generates an image containing text characters that have been distorted in such a way that a human should be able to work out what the characters are, but a machine would have a really hard time of it. A neat idea, but not so great for blind people who rely upon the computer to read the screen for them; and it also turns out that software text recognition has gotten good enough that text sufficiently distorted to fool a program is often illegible to human readers too.
Others have tried using Bayesian text classification to distinguish spam comments from real ones. This still requires some effort from humans to provide a body of training data to calibrate the filter (e.g., “these comments are good ones, these other ones are spam”), but seems to work pretty well in practise. Still, such filters can often generate false positives (“good” comments that get mislabelled as spam) and false negatives (“spam” comments that get mislabelled as good). False positives are particularly annoying for normal users, and can stifle the free flow of discussion.
Various other tricks have been tried that rely upon the assumption that spambots are written by money-grubbing morons who are unlikely to have included Javascript support in their spam distribution tools. Since virtually every modern web browser used by humans does support Javascript, you can gain some leverage over the spambots by requiring that the poster run some bit of complex Javascript code and return the resulting value to the server, before a comment will be accepted as legitimate. Since most of the Javascript code imbedded in a web page will be executed quietly by the browser, without intervention from the human who is driving, this is an appealing strategy for site designers. One solution, based on Adam Back’s Hashcash approach (which is based in turn upon the work of Cynthia Dwork and Moni Naor), requires the client to compute and return a one-way hash value from the contents of the current transaction—so you cannot simply precompute the value and post it many times, even to the same blog site.
Until recently, I had adopted a fairly simple strategy for coping with comment spam: I moderate comments, and anybody whose comments have been accepted in the past will be automatically accepted in the future. Everything else gets stuck in the moderation queue, until I have time to browse through it. This was fine, at first, since the number of regular commentators on my blog is small, and almost everything else is comment spam. But scanning the queue started taking too long—some days, there are upwards of seventy-five new comments that need to be moderated, and only twice in the past six months has there been a comment from a legitimate human user of the site that got held for moderation.** Lately, the spammers have resorted to posting really long comments, too, which means that scanning the queue takes more effort. So I finally decided something had to be done.
What I’ve done now, therefore, is to add a simple text-based challenge-and-response field to the comment posting area. The challenges and responses are very simple and quite easy for a human to answer, but a machine could not reasonably do so without being told the answers in advance. Since I made up all the questions and answers myself, these are not challenges that a spammer could simply download off a web site somewhere and install into his spambots. For example, you might be asked a question like: “How many spots are there on a five-spotted caterpillar?” The answer, of course, is “five.” Any human reading this could answer it, it takes almost no extra time to do, and unlike a purely graphical CAPTCHA, this challenge-response mechanism could easily be used by blind or deaf users, provided they can read and understand English. But short of including a sophisticated natural-language processing system, a spambot will be unable to reliably answer this question correctly (and, let’s face it, if spammers had got that sophisticated a natural language system, they could make much more money selling it as a product in its own right, than sponging off the leavings of the bottom 0.1% of the economic gene pool by spamming).
If you are a regular reader of (and commenter upon) this blog, I hope that you will let me know how this new setup works for you. My hope is that the overhead of answering such a question will not be sufficient to drive you mad, and that the resulting system will cut down the size of the moderation queue to a more rational level.
Update 22-Nov-2006
So far, a pretty good sign for the success of this new plugin is that the moderation queue had only five messages in it this morning, as opposed to the typical forty-five or more. This suggests to me that at least some of the spam drivers are using human stables, as many of the east Asian goldfarming outfits do. I guess that’s a point in favour of the Akismet approach.