Digital Library

cab1

 
Title:      MEASURING EFFECTIVENESS OF TEXT-DECORATED HTML TAGS IN WEB DOCUMENT CLUSTERING
Author(s):      Mark P. Sinka , David W. Corne
ISBN:      972-99353-0-0
Editors:      Pedro Isaías and Nitya Karmakar
Year:      2004
Edition:      1
Keywords:      Web Document Analysis, HTML Text Decoration, Web Document Categorisation.
Type:      Full Paper
First Page:      707
Last Page:      714
Language:      English
Cover:      cover          
Full Contents:      click to dowload Download
Paper Abstract:      Web document analysis, and its associated research, underpins much of what is referred to as web intelligence and the envisaged ‘semantic web’. A key issue in this field is how to encode a web document from the raft of potential document “features” without losing salient information. Current research almost always uses word-based feature vectors such as term frequency of specific words (TF) and/or variants such as normalised term frequency and TF*IDF. We explore the question of whether existing word-based term vectors can be usefully augmented by using text-decorated words delimited by the “

” HTML tag. We measure the effectiveness of a feature vector by encoding documents from a benchmark set in terms of this feature vector, and then measuring the accuracy of an unsupervised clustering task using this encoding. A thorough investigation is performed, using a variety of parameter values, to explore whether any increase is accuracy is achieved over vectors constructed just from the plain document text. Tests on the BankSearch dataset showed 9 different parameter combinations (using the text-decorating tag words) that had an improved accuracy over the vectors obtained via the plain document text.

   

Social Media Links

Search

Login