Optimizing DSPAM + MySQL 4.1

Want to support HowtoForge? Become a subscriber!
 
Submitted by laursen (Contact Author) (Forums) on Fri, 2006-03-31 21:56. :: Anti-Spam/Virus

Introduction

DSPAM is a scalable and open-source content-based spam filter designed for multi-user enterprise systems. It's great at filtering out spam but on busy mailservers the pruning of the MySQL databases takes way too long time ...

The default purge-4.1.sql script provided with DSPAM can be heavily optimized by adding indexes to the database and using the indexes properly when pruning.

Lets start by add some indexes on the dspam_token_data table. Adding INDEX'es and using the indexes correctly allows us to query the dabases extremely fast.

The default script and table structure provided with DSPAM causes full table scans since the data is either not indexed or the indexes are not properly used.

Adding indexes

To begin with we need to add some indexes to the tables. Indexes allows us to query you databases much faster since we don't have to do full table scans. (Our current DSPAM database is 8.5G in size and full table scans literally brings the entire mailserver to a stand still.)

Connect to the database server and issue the following commands:

mysql> alter table dspam_token_data add index(spam_hits);
mysql> alter table dspam_token_data add index(innocent_hits);
mysql> alter table dspam_token_data add index(last_hit);

This will add indexes to the spam_hits, innocent_hits and last_hit colums.

The dspam_signature_data table is already properly indexed - however the indexes are not properly used when cleaning out old data (more about this below).

The interesting parts of the script provided with DSPAM are as follows:

delete from dspam_token_data
where (innocent_hits*2) + spam_hits < 5
and @a-to_days(last_hit) > 60;

This query doesn't use the index on the last_hit column since we call the to_days function on the field and thereby loose the ability to use the index.

Also notice that the extra added indexes on the innocent_hits and spam_hits are used here. Change the query to:

delete from dspam_token_data
where (innocent_hits*2) + spam_hits < 5
and from_days(@a-60) > last_hit;

Next query:

delete from dspam_token_data
where innocent_hits = 1 and spam_hits = 0
and @a-to_days(last_hit) > 15;

Same problem - change this to:

delete from dspam_token_data
where innocent_hits = 1 and spam_hits = 0
and from_days(@a-15) > last_hit;

Next query:

delete from dspam_token_data
where innocent_hits = 0 and spam_hits = 1
and @a-to_days(last_hit) > 15;

Change this to:

delete from dspam_token_data
where innocent_hits = 0 and spam_hits = 1
and from_days(@a-15) > last_hit;

Next query:

delete from dspam_token_data
USING
dspam_token_data LEFT JOIN dspam_preferences
ON dspam_token_data.uid = dspam_preferences.uid
AND dspam_preferences.preference = 'trainingMode'
AND dspam_preferences.value in('TOE','TUM','NOTRAIN')
WHERE @a-to_days(dspam_token_data.last_hit) > 90
AND dspam_preferences.uid IS NULL;

Change this to:

delete from dspam_token_data
USING
dspam_token_data LEFT JOIN dspam_preferences
ON dspam_token_data.uid = dspam_preferences.uid
AND dspam_preferences.preference = 'trainingMode'
AND dspam_preferences.value in('TOE','TUM','NOTRAIN')
WHERE from_days(@a-90) > dspam_token_data.last_hit
AND dspam_preferences.uid IS NULL;

Next query:

delete from dspam_token_data
USING
dspam_token_data LEFT JOIN dspam_preferences
ON dspam_token_data.uid = dspam_preferences.uid
AND dspam_preferences.preference = 'trainingMode'
AND dspam_preferences.value = 'TUM'
WHERE @a-to_days(dspam_token_data.last_hit) > 90
AND innocent_hits + spam_hits < 50
AND dspam_preferences.uid IS NOT NULL;

Change this to:

delete from dspam_token_data
USING
dspam_token_data LEFT JOIN dspam_preferences
ON dspam_token_data.uid = dspam_preferences.uid
AND dspam_preferences.preference = 'trainingMode'
AND dspam_preferences.value = 'TUM'
WHERE from_days(@a-90) > dspam_token_data.last_hit
AND innocent_hits + spam_hits < 50
AND dspam_preferences.uid IS NOT NULL;

And finally:

delete from dspam_signature_data
where @a-14 > to_days(created_on);

should be changed to:

delete from dspam_signature_data
where from_days(@a-14) > created_on;

Testrun with the changed prune script

So does the changes help? Yes! Below are timings for the old unmodified script and the new modified script:

real 2m57.726s
user 0m0.010s
sys 0m0.000s

And for the new modified script (used on the same dataset):

real 0m1.794s
user 0m0.000s
sys 0m0.000s

The script used almost 3 minutes using the default DSPAM script and less than 2 seconds using the altered script and indexes.

Pros and cons

When adding indexes to tables you use far more disk space for your data. If you need the performance when pruning data and can afford to use the extra disk space then add the indexes and change your prune script as explained above. If you only have a small amount of data in your database and performance isn't an issue then stick with the default DSPAM script.

Links

Feedback

All feedback is appreciated - feel free to contact me via email: laursen[at]netgroup.dk


Please do not use the comment function to ask for help! If you need help, please use our forum.
Comments will be published after administrator approval.
Submitted by cici (not registered) on Thu, 2009-01-01 14:20.
Do you use MyISAM engine or InnoDB in your installation? Does it make any difference for index optimization?
Submitted by laradji (registered user) on Tue, 2007-07-17 16:25.

With dspam default install database schema  with not  all the corectly index ,my dspam database about 3G tale 1H30m to run the default maintenance script  ,imagine my system complety  frozen.

The minimum step is to do the correct dspam index like in this article .

Tks for your maintenance query . 

 

Submitted by Anonymous (not registered) on Sat, 2006-06-17 19:43.

Very useful!

Did you tried to use postgresql? Seems, postgresql queries isn't optimal too ;-)

Submitted by brody (registered user) on Fri, 2006-03-31 22:40.

Good Article, it holds its own. It has clear steps, explaining to the user what everything does and on top of that you give the reasons, why or why not a person should use this... good job, very easy read.

Submitted by Anonymous (not registered) on Tue, 2006-08-08 18:18.

I've seen a lot of people write quick and dirty how-to faqs in the past that are similar to this minus one factor that makes this the best I've seen.  I actually understand the purpose and content as a systems engineer without having to go learn mysql in depth to make this work.  It's brief and very helpful in decided if this is right on a given system.

Thanks!