Tugger the SLUGger!SLUG Mailing List Archives

[chat] Re: [SLUG] Meaning of Nonsense in s p a m


Hi Mark,

I read the Paul Graham article, and quite a few others some months ago.  If
i recall correctly, it was stated that testing showed that single word
phrases were enough to identify spam for now.  (This was simpler than i
expected, too, no grammar analysis or order of words etc.) It seemed more
important to weight particular words correctly, thus producing an overall
score which would indicate quite sharply that emails were spam or not spam.

In the case of short messages, it is clearly not possible to say much
without using those words which indicate spam.   At the worst, all they are
able to say is "visit our web site" otherwise they will use a spam word.

In the last few months, as you say, there have been two developments - weird
HTML used to break up words and disguise the spam; and lots of meaningless
phrases.  Clearly the spam filter must remove all html and just process the
text, and now it looks like non-dictionay words will have to be discarded
too.  The problem of people deliberately mispelling words where they put a
one for an L etc etc or dots between all the letters etc etc will now
escalate.  Another tactic will be more embedded gif and jpg files in emails,
to bypass filters by having no text in the body of the message.

Another tactic will be to put the spam message at the start of the email and
follow it up with a whole lot of good scientific or positive text at the far
end, to try and pass the filter with the good words.  

All my non-programmer friends are utterly fed up with spam.  Actually to
point where their use of email is decreasing.  Gee, where is that Senator
Alston when we need him, he could just make it illegal and it would go away.  

Actually, my prediction is that self-training filters will become
commonplace in the next year and we will have to get used to loosing about
one percent of good emails wrongly classified as spam.

IMHO, none of the subscription filters will last the distance; who wants to
pay every month for an update that is instantly out of date.  Spammers can
change addresses faster than the subscription filters can publish the
results.  A free self-training filter is the way to go, but programmes such
as the bogo filter have copped a lot of flack, as there is no ongoing money
to be made down that path.  I look forward to spam filters being on every
magazine cover CD within the next 12 months, and sending spam will become
uneconomic.  In the meantime, the subscription filters and the scramble to
make money will get the filtering industry a bad name.  There is a lengthy
magazine article (Salon mag??) which says spam has a very low response rate,
say only 40 per million.  IMHO, if we can halve this rate, spam will
disappear as uneconomic.  

Incidentally, i complained to a big web site in the usa about a week ago.
Immediately a robot sent me an email.  I had to reply to register myself as
a real sender, and i has to type a password by copying it from a gif file
image!!!  Then i got a reply saying i had been added to the "white list",
and my email would be read.  That's another way of handling spam by
rejecting it, and let's hope no one is given a patent on such an
"innovative" idea, as it's utterly obvious and as old as the hills, like
"Who goes there?".

Brian
======================================================================
At 03:03 AM 14/06/03 -0700, you wrote:
>It's got to do with fooling Bayesian statistical filters, I think. See
>this article for a description:
>http://www.paulgraham.com/spam.html
>
> I guess what normally happens is that if a message is identified as
>spam, its keywords are added to the filter's database of
>spam-keywords. These messages make sure that a bunch of nonsense is
>added to the database at the same time, eventually clogging it up with
>nonsense.
>
>That's just a guess - I'm no expert on this stuff. Surely it's easy
>enough to filter out the nonsense, and _then_ run the message through
>the filter. 
>
>If it was me, I'd be looking for _collocations_ of keywords (e.g.
>'prescription' only if it co-occurs within a given span with 'viagra'
>or 'vallium'...). But like I said, this isn't my field.
>
>Are there any filters that attempt to scan for grammatical pattens,
>like for example imperatives "Visit our exciting web-site!" or clauses
>without a finite verb "Free viagra content here!" ? Or do they all just
>work at the vocabulary level?
>
>mark
>