Digital Library

cab1

 
Title:      DIMENSIONALITY REDUCTION WITH PARALLEL CORPORA
Author(s):      Yingju Xia , Haoyu
ISBN:      978-972-8924-40-9
Editors:      Jörg Roth, Jairo Gutiérrez and Ajith P. Abraham (series editors: Piet Kommers, Pedro Isaías and Nian-Shing Chen)
Year:      2007
Edition:      Single
Keywords:      document representation, dimensionality reduction, parallel corpora
Type:      Short Paper
First Page:      113
Last Page:      118
Language:      English
Cover:      cover          
Full Contents:      click to dowload Download
Paper Abstract:      Query and document representation is a key problem for Web document text mining, text classification and information retrieval. The vector space model (VSM) has been widely used in this domain. But the VSM suffers from high dimensionality. The vectors built form documents always have high dimensionality and contain too much noise. In this paper, we present a novel method that reduces the dimensionality using parallel corpora. We introduce a new metric called frequency distance to measure the translation consistency constraints. We first calculate the frequency distance in parallel corpora. Then we use the frequency distance to deduce a wordlist which will be used for building vectors of queries and documents. The system only uses words in the wordlist to build vectors when applying this wordlist to Web document text mining, text classification or information retrieval. By doing this, the system needs no extra computation for dimensionality reduction and hence promotes the system speed greatly. The experimental results also show that this dimensionality reduction method improves the retrieval performance significantly.
   

Social Media Links

Search

Login