Optimizing DSPAM + MySQL 4.1
Introduction
DSPAM is a scalable and open-source content-based spam filter designed for multi-user enterprise systems. It's great at filtering out spam but on busy mailservers the pruning of the MySQL databases takes way too long time ...
The default purge-4.1.sql script provided with DSPAM can be heavily optimized by adding indexes to the database and using the indexes properly when pruning.
Lets start by add some indexes on the dspam_token_data table. Adding INDEX'es and using the indexes correctly allows us to query the dabases extremely fast.
The default script and table structure provided with DSPAM causes full table scans since the data is either not indexed or the indexes are not properly used.
Adding indexes
To begin with we need to add some indexes to the tables. Indexes allows us to query you databases much faster since we don't have to do full table scans. (Our current DSPAM database is 8.5G in size and full table scans literally brings the entire mailserver to a stand still.)
Connect to the database server and issue the following commands:
mysql> alter table dspam_token_data add index(spam_hits);
mysql> alter table dspam_token_data add index(innocent_hits);
mysql> alter table dspam_token_data add index(last_hit);
This will add indexes to the spam_hits, innocent_hits and last_hit colums.
The dspam_signature_data table is already properly indexed - however the indexes are not properly used when cleaning out old data (more about this below).
The interesting parts of the script provided with DSPAM are as follows:
delete from dspam_token_data
where (innocent_hits*2) + spam_hits < 5
and @a-to_days(last_hit) > 60;
This query doesn't use the index on the last_hit column since we call the to_days function on the field and thereby loose the ability to use the index.
Also notice that the extra added indexes on the innocent_hits and spam_hits are used here. Change the query to:
delete from dspam_token_data
where (innocent_hits*2) + spam_hits < 5
and from_days(@a-60) > last_hit;
Next query:
delete from dspam_token_data
where innocent_hits = 1 and spam_hits = 0
and @a-to_days(last_hit) > 15;
Same problem - change this to:
delete from dspam_token_data
where innocent_hits = 1 and spam_hits = 0
and from_days(@a-15) > last_hit;
Next query:
delete from dspam_token_data
where innocent_hits = 0 and spam_hits = 1
and @a-to_days(last_hit) > 15;
Change this to:
delete from dspam_token_data
where innocent_hits = 0 and spam_hits = 1
and from_days(@a-15) > last_hit;
Next query:
delete from dspam_token_data
USING
dspam_token_data LEFT JOIN dspam_preferences
ON dspam_token_data.uid = dspam_preferences.uid
AND dspam_preferences.preference = 'trainingMode'
AND dspam_preferences.value in('TOE','TUM','NOTRAIN')
WHERE @a-to_days(dspam_token_data.last_hit) > 90
AND dspam_preferences.uid IS NULL;
Change this to:
delete from dspam_token_data
USING
dspam_token_data LEFT JOIN dspam_preferences
ON dspam_token_data.uid = dspam_preferences.uid
AND dspam_preferences.preference = 'trainingMode'
AND dspam_preferences.value in('TOE','TUM','NOTRAIN')
WHERE from_days(@a-90) > dspam_token_data.last_hit
AND dspam_preferences.uid IS NULL;
Next query:
delete from dspam_token_data
USING
dspam_token_data LEFT JOIN dspam_preferences
ON dspam_token_data.uid = dspam_preferences.uid
AND dspam_preferences.preference = 'trainingMode'
AND dspam_preferences.value = 'TUM'
WHERE @a-to_days(dspam_token_data.last_hit) > 90
AND innocent_hits + spam_hits < 50
AND dspam_preferences.uid IS NOT NULL;
Change this to:
delete from dspam_token_data
USING
dspam_token_data LEFT JOIN dspam_preferences
ON dspam_token_data.uid = dspam_preferences.uid
AND dspam_preferences.preference = 'trainingMode'
AND dspam_preferences.value = 'TUM'
WHERE from_days(@a-90) > dspam_token_data.last_hit
AND innocent_hits + spam_hits < 50
AND dspam_preferences.uid IS NOT NULL;
And finally:
delete from dspam_signature_data
where @a-14 > to_days(created_on);
should be changed to:
delete from dspam_signature_data
where from_days(@a-14) > created_on;
Testrun with the changed prune script
So does the changes help? Yes! Below are timings for the old unmodified script and the new modified script:
real 2m57.726s
user 0m0.010s
sys 0m0.000s
And for the new modified script (used on the same dataset):
real 0m1.794s
user 0m0.000s
sys 0m0.000s
The script used almost 3 minutes using the default DSPAM script and less than 2 seconds using the altered script and indexes.
Pros and cons
When adding indexes to tables you use far more disk space for your data. If you need the performance when pruning data and can afford to use the extra disk space then add the indexes and change your prune script as explained above. If you only have a small amount of data in your database and performance isn't an issue then stick with the default DSPAM script.
Links
Feedback
All feedback is appreciated - feel free to contact me via email: laursen[at]netgroup.dk