1 (edited by stevekez 2012-08-19 23:24:38)

Topic: IMAP based Bayesian spam

==== Required information ====
- iRedMail version: 0.8.1
- Store mail accounts in which backend (LDAP/MySQL/PGSQL): LDAP
- Linux/BSD distribution name and version: FreeBSD 9
====

I'm testing a spam learning approach that I thought was worth sharing because I think it's more flexible than the method currently in draft on the wiki.

The approach is similar to that in the wiki, insofaras the data is stored in SQL and amavis is configured in the same way:

http://www.iredmail.org/wiki/index.php? … yes.In.SQL

Where it differs is that I opted not to use markasjunk2, because most users use an IMAP mail client rather than webmail.

The solution was to create a special user account and share a pair of folders to all authorized users. The ACL is set so that users can see the shared folders and place mail into them, but they cannot read the folders. This effectively makes them black holes - write-only. So, users cannot see each other's mail! I had to use telnet to set these up, but it was a one-off effort.

I then modified Nich Burch's IMAP-SA-learn Perl script to login as the special user and run sa-learn on the spam and ham folders. I had to update it to deal with newer command line arguments for sa-learn (e.g. sync instead of rebuild). I also tweaked it to move both spam and ham into archive folders once used by sa-learn.

I've yet to determine if this approach will prove successful in the long run. However I can't see any reason why it would be any worse than the "markasjunk2 webmail" method. You could even create aliases to filter forwarded mail into the spam and ham folders (helping to deal with POP users, for example). However, I haven't done this because mail clients usually add new headers to forwarded mail and throw away others, which might confuse the learning process.

Questions and comments welcome. If it proves successful maybe it'll be a candidate for inclusion in an iRedMail release?

A quick question from me, while I'm here: The Bayes data in the database, in the "bayes_token" table - the primary key is on "id,token" and as a result all rows have the ID 1. Is it meant to be this way? It seems rather odd to me. I know this is a SA issue and not an iRedMail issue, but I thought somebody might know the answer.

Cheers,
Steve.

----

Spider Email Archiver: On-Premises, lightweight email archiving software developed by iRedMail team. Supports Amazon S3 compatible storage and custom branding.

2

Re: IMAP based Bayesian spam

I've yet to find any other people running bayesian learning with mysql & imap folders. Did this work out for you?

I used to do this on my old server but now that we're databasing everything it's not quite as easy as pushing files around.

If you're still around further examples of your scripts would be beneficial (or anyone elses for that matter.)