"…and not for five minutes will I be distracted from the wonder…"

Comment Spam and PHP’s similar_text()

Uncategorized — d-ashes on August 21, 2005 at 8:22 pm

If you’ve wandered by the site in the last month or so (not that
I’ve given you much reason to, admittedly) then you may have noticed
that I now have a small comment spam
problem. I’m only getting between 7 or 10 a week, but it is annoying
and will probably only increase. So this morning I got up and started
at looking at ways that PHP can automatically filter out at least some
of the spam I’m receiving. The most prevalent method I’ve found is to
create an array of keywords to search for and when a comment is
submitted that contains any of those keywords it is identified as a
spam comment. I have two problems with this method:

  • It requires you to manage the list of keywords. Let’s face
    it, I’m lazy, and I don’t want to have to go digging through code for
    that array every time that I want to add a new banned word. I could
    store the words in a database table instead, but that would require me
    to manage that table either manually or by writing a section for it in
    the admin panel.
  • Based on the type of comment spam I’m
    getting (mostly gambling sites) and the company of friends that I keep,
    I would very likely block a legitimate post at some point in the
    future, as I like to play poker and sometimes mention either taking the
    table or losing my ass in a post and the person who I lost my ass to
    usually takes the opportunity to put salt in the wound. What good is a
    spam filter if it’s filtering the wrong things (though this does beg
    for developing a comment politeness filter)?

So, with the idea that I didn’t want to have to constantly manage
the comment spam catcher and that I wanted it to possess a bit of
‘intelligence’, I opted to compare the text of the comment when it was
received with all the messages that had been flagged as comment spam (it’s a
true/false field stored in the comments table of my database) and based
on the similarity with previous spam it would determine whether to
publish the message and not mark it as spam or to not publish it and
mark it as spam (therefore automatically maintaining a bank of messages to search against in the future).

Here’s the function I wrote and included in the ‘add comment’ script. It’s pretty simple stuff, really:

function commentSpam($new_comment) {
	$sql='SELECT * FROM comments WHERE spam=1;';
	$result=mysql_query($sql) or die(mysql_error());
	for($i=0;$i<mysql_numrows($result);$i++) {
		$spam_comment=mysql_result($result,$i,'comment');
		$compare=similar_text($new_comment,$spam_comment);
		if ($compare>100) {
			return true;
		}
	}
	return false;
}

You pass the text of the incoming comment as a parameter and the
function looks up the text of all existing comments marked as spam.
Then, for each comment spam already in the database it compares the
text of it to the new comment’s text with the similar_text()
function. This function returns a number that is the ‘similarity
score’. I ran the function on the text of all of my comments and found
that there was a very wide margin between the non spam comments and the
spam comments I’d already received. In my test spam comments returned a
score of 150 or more (more similar), with most of the scores in the 300
range. Legitimate comments usually returned a similarity score of no
more than 30 (less similar). For the time being I have the function
flagging a comment as spam if it returns a score of 100 or more, which
is pretty strict and may be made less strict (therefore a higher number) as the function is tested
in the real world.

As with any spam counter measures, this method certainly has its limitations.

  • It can only filter based on the comment spam that you’ve
    already received. As more diverse spam comes your way it will take the
    filter possibly missing a few posts that you’ll have to mark as spam
    manually. Really, though, if this thing will flag 3 out of every 5
    spam comments automatically, I’ll be happy.
  • It still stores
    comment spam in your database. If you have a high traffic site that is
    getting a large number of comment spam daily then this will continue to
    inflate the size of your database. A possible solution there is to
    write a function that goes through comments marked as spam and removes
    duplicate entries or entries with extra high similarity scores, since
    you’d only need one of those spam comments in the future to test
    against an incoming comment.
  • Similar_text() is expensive.
    From what I’ve read, it is a heavy draw on the server’s processing
    power. Looping through each existing comment spam and running
    similar_text() for each one exponentially increases that cost. It ran
    quite quickly on my server going through about 20 existing comment spam
    records, but the server load will increase as the number of spam
    comments in the database increases. Some possible fixes are:

    • To put a LIMIT restriction in the sql statement
      (and maybe also to sort by the date submitted to get more recent spam
      attempts) in the above function to only get a certain number of
      existing comment spam records, but this then limits the number of
      possible matches that the function can make to existing spam.
    • Nest
      the function in other logic that uses less taxing methods (like the
      keyword search mentioned above) to identify possible ‘problem comments’
      and then run the function that uses similar_text() if the comment text
      meets those preliminary criteria. This means you wouldn’t necessarily
      have to run the taxing code for every comment received, but does mean
      you have to be diligent in managing your keyword list.

Anyways, I started running the filter today and will post again in
the next couple of weeks as to how good a job it is doing. I haven’t
seen this method mentioned at all when searching for PHP solutions to
fight comment spam which is surprising to me, as it seems pretty
effective in theory and in tests. If any of you fellow geeks know a
good reason why this method isn’t very popular or if you have any
suggestions for improving the function, please let me know.

0 Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. | Ashes & Water