Title:
|
DISCOVERING DATA DEPENDENCIES IN WEB CONTENT MINING |
Author(s):
|
José C. Cortizo , Ignacio Giráldez |
ISBN:
|
972-99353-0-0 |
Editors:
|
Pedro Isaías and Nitya Karmakar |
Year:
|
2004 |
Edition:
|
2 |
Keywords:
|
Web Content Mining; Linear Regression; Bayesian Classifier; Attribute Dependencies. |
Type:
|
Short Paper |
First Page:
|
881 |
Last Page:
|
884 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
Web content mining opens up the possibility to use data presented in web pages for the discovery of interesting and useful patterns. Our web mining tool, FBL (Filtered Bayesian Learning), performs a two stage process: first it analyzes data present in a web page, and then, using information about the data dependencies encountered, it performs the mining phase based on bayesian learning. The Näive Bayes classifier is based on the assumption that the attribute values are conditionally independent for a given the class. This makes it perform very well in some data domains, but performs poorly when attributes are dependent. In this paper, we try to identify those dependencies using linear regression on the attribute values, and then eliminate the attributes which are a linear combination of one or two others. We have tested this system on six web domains (extracting the data by parsing the html), where we have added a synthetic attribute which is a linear combination of two of the original ones. The system detects perfectly those synthetic attributes and also some natural dependent attributes, obtaining a more accurate classifier. |
|
|
|
|