Digital Library

cab1

 
Title:      DISCOVERING DATA DEPENDENCIES IN WEB CONTENT MINING
Author(s):      José C. Cortizo , Ignacio Giráldez
ISBN:      972-99353-0-0
Editors:      Pedro Isaías and Nitya Karmakar
Year:      2004
Edition:      2
Keywords:      Web Content Mining; Linear Regression; Bayesian Classifier; Attribute Dependencies.
Type:      Short Paper
First Page:      881
Last Page:      884
Language:      English
Cover:      cover          
Full Contents:      click to dowload Download
Paper Abstract:      Web content mining opens up the possibility to use data presented in web pages for the discovery of interesting and useful patterns. Our web mining tool, FBL (Filtered Bayesian Learning), performs a two stage process: first it analyzes data present in a web page, and then, using information about the data dependencies encountered, it performs the mining phase based on bayesian learning. The Näive Bayes classifier is based on the assumption that the attribute values are conditionally independent for a given the class. This makes it perform very well in some data domains, but performs poorly when attributes are dependent. In this paper, we try to identify those dependencies using linear regression on the attribute values, and then eliminate the attributes which are a linear combination of one or two others. We have tested this system on six web domains (extracting the data by parsing the html), where we have added a synthetic attribute which is a linear combination of two of the original ones. The system detects perfectly those synthetic attributes and also some “natural” dependent attributes, obtaining a more accurate classifier.
   

Social Media Links

Search

Login