On Sat, 2002-11-09 at 11:52, Richard Ibbotson wrote: > The most common feature throughout the spam that I receive is that > there's a lot of HTML formatted mail. Is there any way that I can > write a regex to cut this out ? What tool do you intend to use? /<\/?[Hh][Tt][Mm][Ll][^>]*>/ would be an obvious (mostly correct) test, but depends what tool you're using. I personally use SpamAssassin, and find I don't get any false positives (apart from spams people forward to me in the hopes I might find it humourous :). It weights mails based on the presence of certain criteria - like HTML mail - and if it passes a threshold you set, it gets marked as spam: SPAM: -------------------- Start SpamAssassin results SPAM: This mail is probably spam. The original message has been altered SPAM: so you can recognise or block similar unwanted mail in future. SPAM: See http://spamassassin.org/tag/ for more details. SPAM: SPAM: Content analysis details: (12 hits, 5 required) SPAM: Hit! (2.4 points) 'Message-Id' was added by a relay (2) SPAM: Hit! (1.3 points) 'Received:' has 'may be forged' warning SPAM: Hit! (0.6 points) From: does not include a real name SPAM: Hit! (1.5 points) BODY: Asks you to click below SPAM: Hit! (3.0 points) URI: Uses a dotted-decimal IP address in URL SPAM: Hit! (0.0 points) BODY: Includes a URL link to send an email SPAM: Hit! (3.2 points) HTML-only mail, with no text version SPAM: SPAM: -------------------- End of SpamAssassin results It then adds "*****SPAM******" to the subject line, and I can filter it out in Evolution. You can mark spam in anyway you like though, even invisibly with a mail header: X-Spam-Status: Yes, hits=12.0 required=5.0 tests=MSG_ID_ADDED_BY_MTA_2,MAY_BE_FORGED,NO_REAL_NAME,CLICK_BELOW, NORMAL_HTTP_TO_IP,MAILTO_LINK,CTYPE_JUST_HTML version=2.20 Which you can also filter out. Cheers, Alex.
Attachment:
signature.asc
Description: This is a digitally signed message part