I hate spammers. And I hate people who help the spammers, whether it's providing hosting for their content or emails, providing or shipping their products, or Windows users who don't secure their machines and end up providing web or DNS hosting, or actually sending the spam, without even realizing it. Worst of all, I hate the IDIOTS who actually BUY things from the spammers, because they're the ones who keep the spammers in business- if nobody were buying, it wouldn't be worth their time to send spam.
One of the tricks used by spammers is to "harvest" email addresses off of web pages. For example, my spamcop address is at the bottom of this page, and as a result, that address gets anywhere from 80 to 200 messages per day- and out of that total, on a good day, FIVE messages might be legitimate. The rest of the messages are all spam.
Another thing I see a lot of, is on web sites with "guest book" or other forms which email data to other places, spammers will submit bogus data to either try and trick the form into sending their spam for them, or just submit the spam message into the form itself, hoping the human who ends up reading the message will see their stuff, and maybe be interested.
In many cases, spammers will run both "harvesters" and form-spammers, but not from the same IP addresses. For example, they may use a zombie Windows machine on a cable modem in California to run a harvester, try any forms it finds from a machine in China, and control the whole thing from an apartment in Romania. And while it's easy to catch the IPs which are sending the spam, it's not so easy to catch the IPs which are harvesting- because they're just looking at your web pages like anybody else.
There is an anti-spam project out there called Project Honeypot. If you own any domains, or have a web site which can host a script, I strongly encourage you to read their web site and, if possible, contribute by hosting a honeypot page and/or donating one or more MX records (I'm hosting a honeypot page and have donated multiple MX records, since I own multiple domains.)
Project Honeypot consists of a network of web pages which contain one or more made-up email addresses, in the hope that the harvesters will find them and try to send spam to them. Of course, the email addresses themselves are not, and have never been valid, and in fact did not exist until the honeypot generated it, so the only thing which would ever be sent to it must be spam.
The way it works is this- the owner of a web site, like myself, installs a script on the site. When a harvester triggers the script (by visiting its URL) the script sends the harvester's IP address to Project Honeypot's server, which then generates a unique random email address, stores it in a database with the harvester's IP, and returns that email address and the URL for another honeypot to the server. The script then shows the harvester a page which contains a legal notice barring them from harvesting email addresses, along with links to the email address and URL returned from Project Honeypot's server. If the harvester follows links, it will then be directed to another honeypot script on some other server, get another bogus email address for their database, and be directed to yet another honeypot script (and so on, and so on, until the harvester decides to stop.)
The email addresses all use domains which are not otherwise used for any legitimate email, and which are handled by email servers controlled by Project Honeypot. When one of these email addresses receives spam, the IP of the original harvester is looked up in their database and added to a public list of known "harvester" IPs.
This list is known as the HTTP Blacklist, or "http:BL".
There are several ways to make use of the http:BL. The most common is probably the Apache mod_httpbl module, which can be configured to block listed IPs from your web sites. There are also plug-ins for several commonly used CMS systems, such as Drupal, WordPress, Joomla, and phpBB. These modules are listed on this page.
However, you may wish to have finer-grained control over what to do with harvesters. You may wish to show them a different version of the page, for example- one with different text (i.e. "Dear spammer, go away kthxdie!") or whose form submits to a non-functional URL so you never get bothered with their junk.
Below is an explanation of how to write this kind of check into your own scripts. The sample code is written in Perl, because that's the language I normally use for things like this, however I will try and explain how it works clearly enough that you can write the equivalent code in any language you like.
Looking up IPs on the http:BL is similar to looking up IPs on other blacklists- it involves reversing the IP, adding a DNS suffix, and checking whether or not the resulting name exists.
Using the http:BL requires you to register with Project Honeypot and request an Access Key. This allows them to track how many requests are coming in from each user, as well as how many different users are seeing traffic from each harvester. The keys are 12 characters in length, on this page I will use "keykeykeykey" as a sample key (rather than sharing my key with the entire world.)
For example, a script which I wrote for a client was accessed early this morning from 201.229.208.2, which IS listed in the http:BL. Using this as an example, you can check whether a given IP is on the list by doing a DNS query like this:
$ nslookup keykeykeykey.2.208.229.201.dnsbl.httpbl.org
Server: 192.168.1.30
Address: 192.168.1.30#53
Non-authoritative answer:
Name: keykeykeykey.2.208.229.201.dnsbl.httpbl.org
Address: 127.1.55.7
As you can see, we reversed the octets (eight-bit parts) of the IP address, added our key and a "." to the beginning, and ".dnsbl.httpbl.org" to the end. It did return an answer, which means that it IS listed. The answer itself tells us several things:
The "1" tells us that it's been 1 day since the last time this IP was seen doing something questionable, such as harvesting email addresses from a honeypot page.
The "55" is a "threat score", a number ranging from 1 to 255, assigned by Project Honeypot. Their web site doesn't tell exactly how this number is calculated, but they do tell us that it's based on a number of factors, including how many honeypots they visit and how many of those visits actually turn into spam being sent. You may wish to use this number for your own purposes- for example, only block visitors who have a score higher than a certain number.
The "7" tells us that the IP is considered "suspicious", has been seen harvesting, and has engaged in "comment spamming" (filling out forms to try and send spam, trying to add spam links to a wiki, etc.) Note that if the last octet is zero, the IP is known to be a search engine spider, and instead of the third octet being a "threat score", it will identify which search engine owns the IP.
The full list of the possible values can be found in Project Honeypot's http:BL API documentation.
An IP which is not listed, will have an NXDOMAIN result from the DNS check (i.e. "name does not exist".) For example, if we look up 208.111.3.163, my web server's IP...
$ nslookup keykeykeykey.163.3.111.208.dnsbl.httpbl.org
Server: 192.168.1.30
Address: 192.168.1.30#53
** server can't find keykeykeykey.163.3.111.208.dnsbl.httpbl.org:
NXDOMAIN
Now that we see the basic method for checking an IP's status, the next step is to turn that into code. Again, I will be using Perl, however it should be fairly simple for any competent programmer to write the equivalent code in any other language they are familiar with.
The core of the process is a DNS lookup. Perl supports the same gethostbyname() function available in C, and it returns the same binary structure that the C function returns- which means the IP addresses are returned as four bytes, rather than a "xx.xx.xx.xx" string which is easier to use. The Socket module contains the function inet_ntoa(), which converts the binary format to a usable string. Our script will need to include this module. These "use" statements are normally done at the beginning of the script.
use Socket ;
I normally take any items which might need to be configured by the user, or which might change in the future, and make them global variables at the beginning of the script. In this case, that means our http:BL key (which anybody using this code will need to customize) and the DNS zone name within which we will be searching (which probably won't change, but you never know what the future holds.)
my $httpbl_key = "keykeykeykey" ;
my $httpbl_zone = "dnsbl.httpbl.org" ;
With these pieces in place, the function to check whether a given IP address is on the list or not, looks like this. It returns 1 if the IP is listed (and not a search engine) or 0 if not.
sub httpbl_check($) { my $ip = ( shift || return 0 ) ; ######################################## # build the name my $rev_ip = join ( "." , reverse split ( /\./ , $ip ) ) ; my $name = "$httpbl_key.$rev_ip.$httpbl_zone" ; ######################################## # query the name my @a = gethostbyname ( $name ) ; unless ( $#a > 3 ) { print STDERR "httpbl allow $ip (empty result)\n" ; return 0 ; } @a = map { inet_ntoa($_) } @a[ 4 .. $#a ] ; ######################################## # split into fields my ( undef , $days , $threat , $type ) = split ( /\./ , $a[0] ) ; ######################################## # search engines (type=0) are okay unless ( $type & 7 ) { print STDERR "httpbl allow $ip -> $a[0] days=$days" . " threat=$threat type=$type\n" ; return 0 ; } ######################################## # others, not so much. print STDERR "httpbl deny $ip -> $a[0] days=$days" . " threat=$threat type=$type\n" ; return 1 ; }
Obviously, how you choose to use the httpbl_check() function within your own code is up to you. Usually the procedure looks something like this:
########################################
# get the client's IP
my $ip = ( $ENV{"REMOTE_ADDR"} || "" ) ;
########################################
# if client's IP is listed on http:BL, don't send the message
if ( $ip )
{
if ( httpbl_check ( $ip ) )
{
print <<EOF ;
Content-type: text/plain
You are not allowed to use this form, because your IP address is known to be
involved in one or more spamming operations. Do the world a favour and get a
real job, spammer.
EOF
exit 0 ;
}
}
########################################
# the IP was not on the list
# do further checks and then send the message
Of course, the form handler script should rely on this function as its only check. It should also check the form data to make sure this isn't a spammer using a previously unknown IP to try and send spam. The safest approach is to assume that the data you receive in the form is always being supplied by a spammer, hacker, or other attacker, and that you can't trust any of it until you have verified that it doesn't contain anything harmful. For example...
Do a full set of security checks on all of the fields you receive from the form. For example, make sure "name" fields aren't more than a certain length (I normally use 40 bytes) and that they only contain "normal" characters. Fields which should contain email addresses should be checked to ensure that they are actually email addresses, and not something else (or something more.)
In particular, any field which isn't a multi-line text field should never contain newline characters, and except in very specific circumstances, fields received from a web page should never contain control (i.e. "non-printable") characters.
Whenever possible, you should also structure your rules so that they check what the value IS, rather than what it IS NOT. For example, if a field is supposed to be an integer, you should test each character to make sure that it IS a digit, rather than checking to make sure it IS NOT a letter. (What if it's punctuation? Or a control character?)
One rule that I use a lot is that every byte of a supplied field must be in the "printable" range, which consists of ASCII codes from 0x20 to 0x7E. Note that if you expect to support Unicode text, the check is a bit more complicated. In Perl, you can usually use the "\p{IsCntrl}" regex pattern to detect control (i.e. non-printable) characters, like so:
use CGI qw ( :standard ) ;
my $value = ( param ( "fieldname" ) || "" ) ;
if ( $value =~ /\p{IsCntrl}/ )
{
die "The value contains control characters.\n" ;
}
Of course, if there are other characters which might be dangerous in your application (such as single quotes or "--" in a SQL-based application) you might want to test for those as well. (However, keep reading for more specific advice on SQL-based apps.)
There is also one very simple rule you should follow, for any script which accepts data and generates email. NEVER, under any circumstances, use any form-supplied data in the headers of the email message. Many so-called "programmers" like to use form-supplied email addresses in a "Reply-To:" or "From:" header, so the recipient can "just hit reply" to reach the person who supposedly filled out the form. DON'T DO THIS. If you keep a log of the data supplied by the web site users, you WILL find spammers trying to submit entire message fragments as email addresses, like this:
sender@domain.xyz\r\nTo: victim@domain.xyz, victim@domain.xyz, victim@domain.xyz\r\nCc: victim@domain.xyz, victim@domain.xyz, victim@domain.xyz\r\nSubject: make money fast\r\n\r\nMake money fast, <a href='http://domain.xyz/'>click here</a>\r\n\r\n
If you use this entire value in a "From" header, you will end up building an email message which IS their spam, and sending it to whoever is listed in their "To:", "Cc:", and "Bcc:" headers. I've seen this happen too many times to not mention it here.
If your script interacts with a SQL database, make sure that any form values which correspond to text fields don't contain any single-quote characters, and that any form values which correspond to numeric fields don't contain anything other than digits (and a single ".", if the field supports non-integral numbers.)
Also, when building queries, if the language or module supplies a quoting function (such as Perl's "DBI::quote()" function), USE IT rather than just manually adding single-quotes to the beginning and end of the data value.
When I'm writing code which involves the Internet, I always try to be as paranoid as I can about the people who will be using the application- especially if it's a web site. I constantly look for ways that the program can be broken, not only from a security standpoint, but also in terms of just bugs in general. I always keep two rules in mind:
If you ever believe that your code is attacker-proof, the world will prove you wrong by producing a better attacker.
If you ever believe that your code is idiot-proof, the world will prove you wrong by producing a better idiot.