Feature Extraction for Spam Classification

File Type:
PDFItem Type:
Masters (Taught)Master of Science (M.Sc.)
Date:
2004-09Author:
Download Item:
Abstract:
E-mail has emerged as one of the primary means of communication used in the world today. Its rapid adoption has left it ripe for misuse and abuse. This came in the guise of Unsolicited Commercial E-mail (UCE) or as it is otherwise known Spam.
For a time spam was considered only a nuisance but due mainly to the copious amounts of spam being sent it has progressed from being a nuisance to become a major problem. The volume of spam has reached epidemic proportions with estimates of up to 80% of all e-mail sent actually being spam.
Spam filtering offers a way to curb the problem. Identifying and removal of spam from the e-mail delivery system allows end-users to regain a useful means of communication. A lot of research in spam filtering has been centred on more sophistication in the classifiers used. This thesis begins to investigate the impact of applying more sophistication to lower layers in the filtering process, namely extracting information from e-mail.
Several types of obfuscation are discussed which are becoming ever more present in spam in order to try confuse and circumvent the current filtering processes. The results obtained by removing certain types of obfuscation show to improve the classification process.
The main theory under investigation was the impact of pair tokens on the classification process. It is quite reasonable to think that pairs of tokens will offer more value than single tokens alone. For example ?enlarge your? seems to suggest more information than single tokens alone. Results obtained show conclusively that pair tokens offer no value and in fact increase error over three independent data sets.
Author: Davy, Michael
Advisor:
Cunningham, PadraigType of material:
Masters (Taught)Master of Science (M.Sc.)
Collections
Availability:
Full text availableKeywords:
Computer ScienceMetadata
Show full item recordLicences: