[ Date Index ][
Thread Index ]
[ <= Previous by date /
thread ]
[ Next by date /
thread => ]
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Simon Waters wrote:
I am writing a router and transport config at the moment so there won't be bounces from backup MX for addresses turned down by the primary.But we need such bounces for when people mistype addresses, and your primary mail server is down or unreachable. The RFCs exist for a reason ;)
Yup, but sod it, in the end this is for my personal domain, which serves me, my fiancee and some housemates. I'll make the messages queue frozen so that I know they're there at least. If I was running some sort of commerical or large scale thing I wouldn't do it like that... Thank god I'm not running some big thing with all the headaches involved.
Been pondering what "nearest neighbours" are - but the big question for spam filtering is can you reasonably expect spam to look different from genuine email.
Indeed you can... There hasn't been much research at all into nearest neighbour methods in spam classification but it does work. SpamKann, my implementation works something like this: * Give the training program a selection of your spam and ham mail. * Training looks at the words in all of that mail, selecting the 150 most important spam words and 50 most important ham words, discounting words common to both. * Training turns all the mails into a 200 dimension feature vector. * Filter program is invoked from MTA or mail client, breaks the mail into a feature vector according to the words found in training. * Filter program calculates the approximate 5 closest training mail vectors to the incoming mail and classifies based on their distance from the incoming mail in the 200 dimensional feature space. I can get about 80% overall accuracy with 0.1% false positives on a test corpus of 15,000 emails. It remains to be seen whether I can get a nice high percentage mark for the project and go on to do a PhD in bioinformatics stuff at Exeter... need to try and land myself a first. The problem with the system is that there are lots of parameters to play with. It's not as simple as bayesian methods but it can be a lot faster than other stuff out there. I get about 500 classifications per second through it. Some quite clever stuff is done with tree representations of the nearest neighbour search space, and also getting approximate nearest neighbours rather than exact ones to save time.
The problem I have is huge spam volumes, some of which is very simiar to genuine email, thus the majority of what the Bayesian and similar filters let through is still spam.
Yeah, this is a big problem in spam classification. You have to be able to learn very subtle differences, without losing the ability to handle massive differences between blatant spam and ham mail. It's all about identifying the right features to classify on. - -- Dave Trudgian - Cornish Dave - ---------------------------- [w] www.trudgian.net [e] dave@xxxxxxxxxxxx [j] trudgiad@xxxxxxxxxxxxxxx -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFAfG+Ut+PdOLWW6O4RAlucAJwKxW5Ae8sYlPOZ18AivvhoakvirQCcCWv/ la+B3AYB6JxRnlIqWD30QEw= =7yGG -----END PGP SIGNATURE----- -- The Mailing List for the Devon & Cornwall LUG Mail majordomo@xxxxxxxxxxxx with "unsubscribe list" in the message body to unsubscribe.