Title:
|
MEASURING EFFECTIVENESS OF TEXT-DECORATED HTML TAGS IN WEB DOCUMENT CLUSTERING |
Author(s):
|
Mark P. Sinka , David W. Corne |
ISBN:
|
972-99353-0-0 |
Editors:
|
Pedro Isaías and Nitya Karmakar |
Year:
|
2004 |
Edition:
|
1 |
Keywords:
|
Web Document Analysis, HTML Text Decoration, Web Document Categorisation. |
Type:
|
Full Paper |
First Page:
|
707 |
Last Page:
|
714 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
Web document analysis, and its associated research, underpins much of what is referred to as web intelligence and the envisaged semantic web. A key issue in this field is how to encode a web document from the raft of potential document features without losing salient information. Current research almost always uses word-based feature vectors such as term frequency of specific words (TF) and/or variants such as normalised term frequency and TF*IDF. We explore the question of whether existing word-based term vectors can be usefully augmented by using text-decorated words delimited by the HTML tag. We measure the effectiveness of a feature vector by encoding documents from a benchmark set in terms of this feature vector, and then measuring the accuracy of an unsupervised clustering task using this encoding. A thorough investigation is performed, using a variety of parameter values, to explore whether any increase is accuracy is achieved over vectors constructed just from the plain document text. Tests on the BankSearch dataset showed 9 different parameter combinations (using the text-decorating tag words) that had an improved accuracy over the vectors obtained via the plain document text. |
|
|
|
|