D&C Lug - Home Page
Devon & Cornwall Linux Users' Group

[ Date Index ][ Thread Index ]
[ <= Previous by date / thread ] [ Next by date / thread => ]

Spam Filtering - Was Re: [LUG] backup MX



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Simon Waters wrote:

Wetware rules!

Nah we are only good at spotting the obvious spams, the computers do
that much quicker anyway.

I suspect we are only doing better as the spam filters tend to ignore
what they don't understand, where as we count some of it against it.

There is a commonly held view that any automatic spam filter shouldn't produce 
any false positives. This generally prevents people implementing spam filters 
than generalise a large amount. They are much better at classifying spam, but 
could misclassify some ham.

There is also a common view that an automatic spam filter should perform as a 
human does when identifying spam since we are good at it. It's been proposed 
that this idea doesn't agree with the first... How many times have you 
accidentally deleted a good mail in a flurry of bashing "del"? 

A recent MIT spam conference discussed this. Those present seemed to think 
that we needed to change are ideas about what is acceptable performance from 
spam filters. I'm sceptical whether people will start looking at false 
positives as acceptable in order to get a spam filter that generalises well. 
We shall see how things develop.

For example quite a lot of the spam getting past spam assassin
deliberately misspells all the obvious keywords - well I spot "Vaigra"
and hit delete.

Since quick and effective spell checking tools exist, I dare say this is
a class of spam we could kill if anyone cared enough to code it.

Perfectly feasible stuff, I looked at doing this in fact. The trouble is 
getting the balance right between catching a few more spam mails and taking 
longer to work out what features to classify on.

SpamAssassin with its growing ruleset is already a monster when it comes to 
feature extraction times. A recent paper put it at 1784s per 1000 messages on 
an Athlon XP 1800. It doesn't score much better than methods like mine that 
consider only 200 words and how often they occur in each mail. There is a 
huge difference in speed however.

So to solve the spam problem, first, solve the AI Problem.

Nope you can pretty much solve the spam problem today by checking the
sender is known to you it's crude but even OE gives you a button to do it.

A quote from Paul Graham's "A Plan for Spam":

"The Achilles heel of the spammers is their message. They can circumvent any 
other barrier you set up. They have so far, at least. But they have to 
deliver their message, whatever it is. If we can write software that 
recognises their messages, there is no way they can get around that."

I tend to agree with this. Yes. methods such as signing mail can stop things 
dead but aren't practical for all. Spamming is a business at the end of the 
day. If you can come up with something that filters well on content then the 
message doesn't get through. If the message doesn't get through then you have 
no profit.

An interesting paper comparing spam filtering techniques can be found at:

http://nexp.cs.pdx.edu/twiki-psam/pub/PSAM/PsamDocumentation/spam.pdf

It was part of a USENIX conference so it's quite readable.

- -- 
Dave Trudgian - Cornish Dave
- ----------------------------
[w] www.trudgian.net
[e] dave@xxxxxxxxxxxx
[j] trudgiad@xxxxxxxxxxxxxxx

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAfTBct+PdOLWW6O4RAhe6AJ9OFFVkFzO16VtNe8pujZ5bA8EneACfUvWQ
5yw9c8zA0rWQBfeTCYhI23I=
=ptDz
-----END PGP SIGNATURE-----


--
The Mailing List for the Devon & Cornwall LUG
Mail majordomo@xxxxxxxxxxxx with "unsubscribe list" in the
message body to unsubscribe.



Lynx friendly