Title:
|
SPAM FILTERING USING COMPRESSION |
Author(s):
|
M. Farmer , G. Richard , F. Faure , M. Lopusniac |
ISBN:
|
978-972-8924-49-2 |
Editors:
|
Sandeep Krishnamurthy and Pedro Isaías |
Year:
|
2007 |
Edition:
|
Single |
Keywords:
|
Spam, Kolmogorov Complexity, Compression, Clustering, K-Nearest Neighbors |
Type:
|
Full Paper |
First Page:
|
67 |
Last Page:
|
74 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
One of the most irrelevant side effects of the e-commerce technology is the development of spamming as an e-marketing
technique. Spam emails (or unsolicited commercial emails) induce a burden for everybody having an electronic mailbox:
detecting and filtering spam is then a challenging task and a lot of approaches have been developed to identify a spam
before it is posted in the end user mailbox. In this paper, we focus on a relatively new approach whose foundations rely
on the works of A. Kolmogorov. The main idea is to give a formal meaning to the notion of information content and to
provide a measure of this content. Using such a quantitative approach, it becomes possible to define a distance which is a
major tool for classification purpose. To validate our approach, we proceed in two steps: first we use the classical
compression distance over a mix of spam and legitimate emails to check out if they can be properly clustered without any
supervision. It has been the case, highlighting a kind of underlying structure for the spam emails. In a second step, we
implement a k-nearest neighbors algorithm providing 85% as accuracy rate. Coupled with other anti-spam techniques,
compression-based methods could bring a great help in the spam filtering challenge. |
|
|
|
|