Fight Image Spam With FuzzyOCR And SpamAssassin On Debian Lenny
Fight Image Spam With FuzzyOCR And SpamAssassin On Debian LennyVersion 1.0 This tutorial describes how to scan emails for image spam with FuzzyOCR on a Debian Lenny server. FuzzyOCR is a plugin for SpamAssassin which is aimed at unsolicited bulk mail containing images as the main content carrier. Using different methods, it analyzes the content and properties of images to distinguish between normal mails (ham) and spam mails. FuzzyOCR tries to keep the system load low by scanning only mails that have not already been categorized as spam by SpamAssassin, thus avoiding unnecessary work. I do not issue any guarantee that this will work for you!
1 Preliminary NoteIn this article I will use Debian Lenny for the base system. I assume that SpamAssassin is already installed and working, with /etc/mail/spamassassin/ as its main configuration directory. If your directory is different (e.g. if you have ISPConfig 2 installed, the directory is /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/), this is no problem. I will annotate where to change what. Please make sure that your SpamAssassin version works with FuzzyOCR. For example, the FuzzyOCR version I'm going to install here (fuzzyocr-3.5.1-devel.tar.gz) requires SpamAssassin 3.1.4 or newer.
2 Install The Prerequisites For FuzzyOCRFuzzyOCR has some prerequisites like ocrad and gocr that we can install like this: aptitude install netpbm gifsicle libungif-bin gocr ocrad libstring-approx-perl libmldbm-sync-perl imagemagick tesseract-ocr
3 Install FuzzyOCRNext we download and install the latest FuzzyOCR devel version from http://fuzzyocr.own-hero.net/wiki/Downloads. We download the devel version instead of the stable version because the FuzzyOCR developers say: "The current recommendation is the development version because the stable version lacks features and is very old." cd /usr/src/ Then we unpack FuzzyOCR and move all FuzzyOcr* files and the FuzzyOcr directory (they are all in the FuzzyOcr-3.5.1/ directory) to /etc/mail/spamassassin: tar xvfz fuzzyocr-3.5.1-devel.tar.gz If your SpamAssassin directory is different, e.g. /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/, then the last command should be replaced with mv FuzzyOcr* /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/ Don't delete the /usr/src/FuzzyOcr-3.5.1/ directory yet, there's a directory with sample image spam emails in there (samples/) that we need later on to test if FuzzyOCR is working as expected. So FuzzyOCR is now installed, now we need to configure it.
4 Configure FuzzyOCRFuzzyOCR's configuration file is /etc/mail/spamassassin/FuzzyOcr.cf. In that file almost everything is commented out. We open that file now and make some modifications: vi /etc/mail/spamassassin/FuzzyOcr.cf Put the following line into it to define the location of FuzzyOCR's spam words file:
/etc/mail/spamassassin/FuzzyOcr.words is a predefined word list that comes with FuzzyOCR. You can adjust it to your needs if you like. Next change
to
Finally add/enable the following lines:
With the last four lines you enable image hashing. This is what the FuzzyOCR developers say about image hashing: "The Image hashing database feature allows the plugin to store a vector of image features to a database, so it knows this image when it arrives a second time (and therefore does not need to scan it again). The special thing about this function is that it also recognizes the image again if it was changed slightly (which is done by spammers). " If you use /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin instead of /etc/mail/spamassassin, FuzzyOCR's configuration file is /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/FuzzyOcr.cf instead of /etc/mail/spamassassin/FuzzyOcr.cf, so edit that one. In the configuration file you can now either replace all occurrences of /etc/mail/spamassassin with /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin, OR you leave it as shown before and create a symlink from /etc/mail/spamassassin to /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin like this: mkdir /etc/mail/ That's it already for the FuzzyOCR configuration. Now let's see if it works as expected.
5 Test FuzzyOCRI mentioned before that FuzzyOCR comes with sample image spam mails (in the samples/ directory): ls -l /usr/src/FuzzyOcr-3.5.1/samples/ The output should look like this: total 156 We can feed each of these emails to SpamAssassin now to see if FuzzyOCR is linked correctly into SpamAssassin. Find out where your spamassassin executable is (normally it's in your PATH - you can find out if this is the case by running which spamassassin If it shows a result, spamassassin is in your PATH, and you don't need to specify the full path to spamassassin to run it.) If you don't know where spamassassin is, you can find out by running updatedb If you use ISPConfig 2, spamassassin is here: /home/admispconfig/ispconfig/tools/spamassassin/usr/bin/spamassassin Now that you know where spamassassin is, you can feed the sample image spam mails to spamassassin like this: /path/to/spamassassin --debug FuzzyOcr < /usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml > /dev/null E.g. /home/admispconfig/ispconfig/tools/spamassassin/usr/bin/spamassassin --debug FuzzyOcr < /usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml > /dev/null or, if spamassassin is in your PATH: spamassassin --debug FuzzyOcr < /usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml > /dev/null You should now see a lot of output, the end should look like this: [...] As you see /usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml has been categorized as spam with a score of 15 points, so FuzzyOCR is working. So your SpamAssassin is now able to recognize image spam thanks to the help of FuzzyOCR.
6 Links
|




Recent comments
21 hours 36 min ago
1 day 8 hours ago
1 day 14 hours ago
2 days 9 hours ago
2 days 10 hours ago
2 days 11 hours ago
2 days 14 hours ago
2 days 15 hours ago
3 days 6 hours ago
3 days 8 hours ago