Title:
|
DOCUMENTS CLUSTERING FOR NOISE REMOVAL |
Author(s):
|
Abdulmohsen Algarni, Nasser Tairan |
ISBN:
|
978-989-8704-10-8 |
Editors:
|
Ajith P. Abraham, Antonio Palma dos Reis and Jörg Roth |
Year:
|
2014 |
Edition:
|
Single |
Keywords:
|
Text mining, Clustering, Pre-processing, Dimension reduction |
Type:
|
Short Paper |
First Page:
|
185 |
Last Page:
|
189 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
Term-based approaches can extract many features in text documents, but most include noise. Many popular text mining strategies have been adapted to reduce noisy information from extracted features, but these still contain some noise features. However, these noise features can be extracted from the same training documents that the good features were extracted from. Therefore, the main problem is that some training documents contain a large amount of noise data. Reducing the noise data in the training documents would help to reduce noise in the extracted features. Moreover, we believe that removing some training documents (documents that contain more noise data than useful data) can help to improve the effectiveness of a classifier. Using the advantages of the clustering method can help to reduce the effect of noise data. The main problem of clustering is defined to be that of finding groups of similar projects in the data. In this paper, we introduce the methodology of using a clustering algorithm to group training data before it is used. We also test our theory that not all training documents are useful in training the classifier. |
|
|
|
|