Fight Image Spam With FuzzyOCR And SpamAssassin On Ubuntu 9.10
Version 1.0
Author: Falko Timme
Follow me on Twitter
This tutorial describes how to scan emails for image spam with FuzzyOCR on an Ubuntu 9.10 server. FuzzyOCR is a plugin for SpamAssassin which is aimed at unsolicited bulk mail containing images as the main content carrier. Using different methods, it analyzes the content and properties of images to distinguish between normal mails (ham) and spam mails. FuzzyOCR tries to keep the system load low by scanning only mails that have not already been categorized as spam by SpamAssassin, thus avoiding unnecessary work.
I do not issue any guarantee that this will work for you!
1 Preliminary Note
In this article I will use Ubuntu 9.10 for the base system.
I assume that SpamAssassin is already installed and working, with /etc/mail/spamassassin/ as its main configuration directory. If your directory is different (e.g. if you have ISPConfig 2 installed, the directory is /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/), this is no problem. I will annotate where to change what.
Please make sure that your SpamAssassin version works with FuzzyOCR. For example, the FuzzyOCR version I'm going to install here (fuzzyocr-3.5.1) requires SpamAssassin 3.1.4 or newer.
2 Install FuzzyOCR
FuzzyOCR can be installed as follows:
aptitude install fuzzyocr netpbm gifsicle libungif-bin gocr ocrad libstring-approx-perl libmldbm-sync-perl imagemagick tesseract-ocr
This will place the FuzzyOCR configuration files in the /etc/mail/spamassassin/ directory.
If your SpamAssassin directory is different, e.g. /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/, then you can copy the FuzzyOCR configuration files to that directory as follows:
cp /etc/mail/spamassassin/FuzzyOcr* /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/
So FuzzyOCR is now installed, now we need to configure it.
3 Configure FuzzyOCR
FuzzyOCR's configuration file is /etc/mail/spamassassin/FuzzyOcr.cf. In that file almost everything is commented out. We open that file now and make some modifications:
vi /etc/mail/spamassassin/FuzzyOcr.cf
Put the following line into it to define the location of FuzzyOCR's spam words file:
[...] focr_global_wordlist /etc/mail/spamassassin/FuzzyOcr.words [...] |
/etc/mail/spamassassin/FuzzyOcr.words is a predefined word list that comes with FuzzyOCR. You can adjust it to your needs if you like.
Next change
[...] # Include additional scanner/preprocessor commands here: # focr_bin_helper pnmnorm, pnminvert, ppmtopgm #not available in Debian: pamthreshold,pamtopnm focr_bin_helper tesseract [...] |
to
[...] # Include additional scanner/preprocessor commands here: # #focr_bin_helper pnmnorm, pnminvert, ppmtopgm #not available in Debian: pamthreshold,pamtopnm #focr_bin_helper tesseract focr_bin_helper pnmnorm, pnminvert, convert, ppmtopgm, tesseract [...] |
Finally add/enable the following lines:
[...] # Search path for locating helper applications focr_path_bin /usr/local/netpbm/bin:/usr/local/bin:/usr/bin focr_preprocessor_file /etc/mail/spamassassin/FuzzyOcr.preps focr_scanset_file /etc/mail/spamassassin/FuzzyOcr.scansets focr_enable_image_hashing 2 focr_digest_db /etc/mail/spamassassin/FuzzyOcr.hashdb focr_db_hash /etc/mail/spamassassin/FuzzyOcr.db focr_db_safe /etc/mail/spamassassin/FuzzyOcr.safe.db [...] |
With the last four lines you enable image hashing. This is what the FuzzyOCR developers say about image hashing:
"The Image hashing database feature allows the plugin to store a vector of image features to a database, so it knows this image when it arrives a second time (and therefore does not need to scan it again). The special thing about this function is that it also recognizes the image again if it was changed slightly (which is done by spammers). "
If you use /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin instead of /etc/mail/spamassassin, FuzzyOCR's configuration file is /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin/FuzzyOcr.cf instead of /etc/mail/spamassassin/FuzzyOcr.cf, so edit that one. In the configuration file you must now make sure that you use the correct path (i.e. /home/admispconfig/ispconfig/tools/spamassassin/etc/mail/spamassassin).
That's it already for the FuzzyOCR configuration. Now let's see if it works as expected.
4 Test FuzzyOCR
FuzzyOCR comes with sample image spam mails (in the /usr/share/doc/fuzzyocr/examples/ directory):
ls -l /usr/share/doc/fuzzyocr/examples/
The output should look like this:
total 156
-rw-r--r-- 1 root root 13633 2008-09-25 22:47 ocr-animated.eml
-rw-r--r-- 1 root root 16108 2008-09-25 22:47 ocr-gif.eml
-rw-r--r-- 1 root root 27506 2008-09-25 22:47 ocr-jpg.eml
-rw-r--r-- 1 root root 27842 2008-09-25 22:47 ocr-multi.eml
-rw-r--r-- 1 root root 24657 2008-09-25 22:47 ocr-obfuscated.eml
-rw-r--r-- 1 root root 18236 2008-09-25 22:47 ocr-png.eml
-rw-r--r-- 1 root root 16113 2008-09-25 22:47 ocr-wrongext.eml
-rw-r--r-- 1 root root 3576 2008-09-25 22:47 README
We can feed each of these emails to SpamAssassin now to see if FuzzyOCR is linked correctly into SpamAssassin. Find out where your spamassassin executable is (normally it's in your PATH - you can find out if this is the case by running
which spamassassin
If it shows a result, spamassassin is in your PATH, and you don't need to specify the full path to spamassassin to run it.)
If you don't know where spamassassin is, you can find out by running
updatedb
locate spamassassin
If you use ISPConfig 2, spamassassin is here: /home/admispconfig/ispconfig/tools/spamassassin/usr/bin/spamassassin
Now that you know where spamassassin is, you can feed the sample image spam mails to spamassassin like this:
/path/to/spamassassin --debug FuzzyOcr < /usr/share/doc/fuzzyocr/examples/ocr-gif.eml > /dev/null
E.g.
/home/admispconfig/ispconfig/tools/spamassassin/usr/bin/spamassassin --debug FuzzyOcr < /usr/share/doc/fuzzyocr/examples/ocr-gif.eml > /dev/null
or, if spamassassin is in your PATH:
spamassassin --debug FuzzyOcr < /usr/share/doc/fuzzyocr/examples/ocr-gif.eml > /dev/null
You should now see a lot of output, the end should look like this:
[...]
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: Friday Augurt 4, 4:01 pm ET
[10025] dbg: FuzzyOcr: LAS VEGAS, NEVADA--(MARKET WIRE)--Aug 4, 2006 -- auantum Energy, lnc. (OTC
[10025] dbg: FuzzyOcr: BB:aEGY.oB-_-
[10025] dbg: FuzzyOcr: auantum Energy, lnc. is pleased to announce that it has applied to have its shares listed for
[10025] dbg: FuzzyOcr: trading on the Frankfurt Stock Exchange. The company has retained the services ofBaltic
[10025] dbg: FuzzyOcr: lnvestment Group of Hamburg, Germany to assist with the application.
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: _ qEGY,OB "
[10025] dbg: FuzzyOcr:
[10025] dbg: FuzzyOcr: <<=end
[10025] info: FuzzyOcr: Scanset "ocrad" found word "target" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "short term price target oo"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "service" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "stock" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "hot energy stocki"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "stock" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "price" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "current price o"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "price" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "short term price target oo"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "company" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "trading on the frankfurt stock exchange the company has retained the services ofbaltic"
[10025] info: FuzzyOcr: Scanset "ocrad" found word "recommendation" with fuzz of 0.0000
[10025] info: FuzzyOcr: line: "sboog bup recommendation"
[10025] dbg: FuzzyOcr: Enough OCR Hits without space stripping, skipping second matching pass...
[10025] info: FuzzyOcr: Scanset "ocrad" generates enough hits (8), skipping further scansets...
[10025] info: FuzzyOcr: Message is spam, score = 15.000
[10025] info: FuzzyOcr: Adding Hash to "/etc/mail/spamassassin/FuzzyOcr.db" with score "15.000"
[10025] dbg: FuzzyOcr: Digest: 538584:327:549:7::255:255:255:255:168580::0:0:0:0:9098::0:128:0:75:1086::0:0:128:15:395::128:0:128:53:213::0:0:255:29:115
[10025] info: FuzzyOcr: Words found:
[10025] info: FuzzyOcr: "target" in 1 lines
[10025] info: FuzzyOcr: "service" in 1 lines
[10025] info: FuzzyOcr: "stock" in 2 lines
[10025] info: FuzzyOcr: "price" in 2 lines
[10025] info: FuzzyOcr: "company" in 1 lines
[10025] info: FuzzyOcr: "recommendation" in 1 lines
[10025] info: FuzzyOcr: (12 word occurrences found)
[10025] dbg: FuzzyOcr: Remove DIR: /tmp/.spamassassin10025QnPTq8tmp
[10025] dbg: FuzzyOcr: FuzzyOcr ending successfully...
[10025] dbg: FuzzyOcr: Processed in 2.191381 sec.
As you see /usr/share/doc/fuzzyocr/examples/ocr-gif.eml has been categorized as spam with a score of 15 points, so FuzzyOCR is working.
So your SpamAssassin is now able to recognize image spam thanks to the help of FuzzyOCR.
5 Links
- FuzzyOCR: http://www.fuzzyocr.net/
- SpamAssassin: http://spamassassin.apache.org/
- Ubuntu: http://www.ubuntu.com/