Title:
|
DIMENSIONALITY REDUCTION WITH PARALLEL CORPORA |
Author(s):
|
Yingju Xia , Haoyu |
ISBN:
|
978-972-8924-40-9 |
Editors:
|
Jörg Roth, Jairo Gutiérrez and Ajith P. Abraham (series editors: Piet Kommers, Pedro Isaías and Nian-Shing Chen) |
Year:
|
2007 |
Edition:
|
Single |
Keywords:
|
document representation, dimensionality reduction, parallel corpora |
Type:
|
Short Paper |
First Page:
|
113 |
Last Page:
|
118 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
Query and document representation is a key problem for Web document text mining, text classification and information
retrieval. The vector space model (VSM) has been widely used in this domain. But the VSM suffers from high
dimensionality. The vectors built form documents always have high dimensionality and contain too much noise. In this
paper, we present a novel method that reduces the dimensionality using parallel corpora. We introduce a new metric
called frequency distance to measure the translation consistency constraints. We first calculate the frequency distance in
parallel corpora. Then we use the frequency distance to deduce a wordlist which will be used for building vectors of
queries and documents. The system only uses words in the wordlist to build vectors when applying this wordlist to Web
document text mining, text classification or information retrieval. By doing this, the system needs no extra computation
for dimensionality reduction and hence promotes the system speed greatly. The experimental results also show that this
dimensionality reduction method improves the retrieval performance significantly. |
|
|
|
|