D&C Lug - Home Page
Devon & Cornwall Linux Users' Group

[ Date Index ][ Thread Index ]
[ <= Previous by date / thread ] [ Next by date / thread => ]

Re: [LUG] backup MX



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Simon Waters wrote:

I am writing a router and
transport config at the moment so there won't be bounces from backup MX
for addresses turned down by the primary.

But we need such bounces for when people mistype addresses, and your
primary mail server is down or unreachable.

The RFCs exist for a reason ;)

Yup, but sod it, in the end this is for my personal domain, which serves me, 
my fiancee and some housemates. I'll make the messages queue frozen so that I 
know they're there at least. If I was running some sort of commerical or 
large scale thing I wouldn't do it like that... Thank god I'm not running 
some big thing with all the headaches involved.

Been pondering what "nearest neighbours" are - but the big question for
spam filtering is can you reasonably expect spam to look different from
genuine email.

Indeed you can... There hasn't been much research at all into nearest 
neighbour methods in spam classification but it does work. SpamKann, my 
implementation works something like this:

* Give the training program a selection of your spam and ham mail.
* Training looks at the words in all of that mail, selecting the 150 most 
important spam words and 50 most important ham words, discounting words 
common to both.
* Training turns all the mails into a 200 dimension feature vector.
* Filter program is invoked from MTA or mail client, breaks the mail into a 
feature vector according to the words found in training.
* Filter program calculates the approximate 5 closest training mail vectors to 
the incoming mail and classifies based on their distance from the incoming 
mail in the 200 dimensional feature space.

I can get about 80% overall accuracy with 0.1% false positives on a test 
corpus of 15,000 emails. It remains to be seen whether I can get a nice high 
percentage mark for the project and go on to do a PhD in bioinformatics stuff 
at Exeter... need to try and land myself a first.

The problem with the system is that there are lots of parameters to play with. 
It's not as simple as bayesian methods but it can be a lot faster than other 
stuff out there. I get about 500 classifications per second through it. Some 
quite clever stuff is done with tree representations of the nearest neighbour 
search space, and also getting approximate nearest neighbours rather than 
exact ones to save time.

The problem I have is huge spam volumes, some of which is very simiar to
genuine email, thus the majority of what the Bayesian and similar
filters let through is still spam.

Yeah, this is a big problem in spam classification. You have to be able to 
learn very subtle differences, without losing the ability to handle massive 
differences between blatant spam and ham mail. It's all about identifying the 
right features to classify on.

- -- 
Dave Trudgian - Cornish Dave
- ----------------------------
[w] www.trudgian.net
[e] dave@xxxxxxxxxxxx
[j] trudgiad@xxxxxxxxxxxxxxx

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAfG+Ut+PdOLWW6O4RAlucAJwKxW5Ae8sYlPOZ18AivvhoakvirQCcCWv/
la+B3AYB6JxRnlIqWD30QEw=
=7yGG
-----END PGP SIGNATURE-----


--
The Mailing List for the Devon & Cornwall LUG
Mail majordomo@xxxxxxxxxxxx with "unsubscribe list" in the
message body to unsubscribe.


Lynx friendly