16 April 2009 18:54 -
Most people are familiar with using some kind of naive bayesian filtering to
classify mail as spam or non-spam. Or at least they use a spam filter that uses
bayesian filtering without knowing that it uses bayesian filtering. Nothing new
so far. But beside spam I had another criteria for classifying email: How
important is it to me? In my opinion there are two kinds of mail besides of
course spam email.
The first being mail that needs no immediate action. In example linkedin invites, the
mail with the link to that funny youtube movie, mailstatistics from servers,
updates of who started following you on twitter etcetera.
The second is of course email that needs immediate action. In example the mail
from nagios that your raid 1 mirror is degraded or the invitation to that cool
party. For me this is email I want to see when I am travelling and can not wait
till I am at my desk again.
Therefor I wanted to classify every incoming none spam mail into the important
or unimportant category. Hence I decided to try classifying my email this way
with bayesian filtering. The idea was that bayesian filtering should save me
the overhead of setting up complex rules.
Setting this up was straight forward. First I divided all regular email of the
last 2 months in 2 categories: important and unimportant. This gave me the
training sets to start training my filter. After this I trained the filter and
then configured the filter to place all unimportant email in another 'later'
mailbox folder, instead of putting it in the default Inbox.
Now the only challenge was making sure classification mistakes gets corrected. This is done by
archiving every mail in one of two archive mailboxes. In my case '2009l' or '2009n'. Every night
I run a scheduled job to retrain the filter on these archives.
An interesting question after all this is: Does it work? In short yes, it does work. Around 90%
of my incoming email is classified correctly and I did not miss any cool party as far as I know.