by Michael Dinowitz
Before we try to understand spam, we should try to get some. Yes, I know this sounds like a silly thing to say as anyone with email gets spam, but what I'm going to describe is a way to get a particular type of spam: spam from page scrapers.
Page scraping is when a spammer has a web agent or the like visit your web site, download all of your web pages and look for anything that is in a mailto link or is structured like an email address. When a page scraper gets such an address, it is sure to be sent spam in no time at all.
If the last article described how to defeat screen scraping, then how do we expect to catch screen scrapers now? That's easy. We create a few email addresses inside of HTML comment tags and place them on the site. Any mail to these addresses MUST be from screen scrapers and will be our 'baseline' for spam. This will be important later but for now it'll help us determine which domains and IPs send spam. How do we find this information? It's in the mail header, which we will examine in detail in the next issue.