Re: Spam Filtering - Was Re: [LUG] backup MX

To: list@xxxxxxxxxxxx
Subject: Re: Spam Filtering - Was Re: [LUG] backup MX
From: Simon Waters <Simon@xxxxxxxxxxxxxxxxxxxx>
Date: Wed, 14 Apr 2004 17:50:11 +0100
Reply-to: list@xxxxxxxxxxxx

Dave Trudgian wrote:

Simon Waters wrote:

Wetware rules!


Nah we are only good at spotting the obvious spams, the computers do
that much quicker anyway.

I suspect we are only doing better as the spam filters tend to ignore
what they don't understand, where as we count some of it against it.



There is a commonly held view that any automatic spam filter shouldn't produce 
any false positives. This generally prevents people implementing spam filters 
than generalise a large amount. They are much better at classifying spam, but 
could misclassify some ham.


Misclassification isn't as big a problem if as a result you issue a 5xx
error. I think the problem is we've tended to "post filter" after MTA
rather than before.

Although if you have back-up MXs, or otherwise destroy the point to
point model, you could misclassify the bounce of a misclassified genuine
message.

A recent MIT spam conference discussed this. Those present seemed to think 
that we needed to change are ideas about what is acceptable performance from 
spam filters. I'm sceptical whether people will start looking at false 
positives as acceptable in order to get a spam filter that generalises well. 
We shall see how things develop.


Having gone with TMDA - I have false positives, but they are largely
machine generated emails - not all false positives are equal.

For example quite a lot of the spam getting past spam assassin
deliberately misspells all the obvious keywords - well I spot "Vaigra"
and hit delete.

Since quick and effective spell checking tools exist, I dare say this is
a class of spam we could kill if anyone cared enough to code it.



Perfectly feasible stuff, I looked at doing this in fact. The trouble is 
getting the balance right between catching a few more spam mails and taking 
longer to work out what features to classify on.


Err on the side of better classification, computationally efficient is
good, but it needs to be effective first.

A quote from Paul Graham's "A Plan for Spam":

"The Achilles heel of the spammers is their message. They can circumvent any 
other barrier you set up. They have so far, at least. But they have to 
deliver their message, whatever it is. If we can write software that 
recognises their messages, there is no way they can get around that."


Too simple - the message is too easy to hide from the machines.

If you can come up with something that filters well on content then the 
message doesn't get through.


This isn't my experience.

Attachment: signature.asc
Description: OpenPGP digital signature

References:
- [LUG] backup MX
  - From: Neil Russell
- Re: [LUG] backup MX
  - From: Adrian Midgley
- Re: [LUG] backup MX
  - From: Simon Waters
- Spam Filtering - Was Re: [LUG] backup MX
  - From: Dave Trudgian

Prev by Date: Re: [LUG] Tape Backups
Next by Date: Re: [LUG] Tape Backups
Previous by thread: Spam Filtering - Was Re: [LUG] backup MX
Next by thread: Re: [LUG] backup MX
Index(es):
- Date
- Thread

Lynx friendly