DISCOVERING DATA DEPENDENCIES IN WEB CONTENT MINING

José C. Cortizo; Ignacio Giráldez

Home

Digital Library

Visit Digital Library

Conference Proceedings

IADIS International Conference WWW/Internet - ICWI

IADIS International Conference WWW/Internet 2004

Document Info

Title:	DISCOVERING DATA DEPENDENCIES IN WEB CONTENT MINING
Author(s):	José C. Cortizo , Ignacio Giráldez
ISBN:	972-99353-0-0
Editors:	Pedro Isaías and Nitya Karmakar
Year:	2004
Edition:	2
Keywords:	Web Content Mining; Linear Regression; Bayesian Classifier; Attribute Dependencies.
Type:	Short Paper
First Page:	881
Last Page:	884
Language:	English
Cover:
Full Contents:	click to dowload
Paper Abstract:	Web content mining opens up the possibility to use data presented in web pages for the discovery of interesting and useful patterns. Our web mining tool, FBL (Filtered Bayesian Learning), performs a two stage process: first it analyzes data present in a web page, and then, using information about the data dependencies encountered, it performs the mining phase based on bayesian learning. The Näive Bayes classifier is based on the assumption that the attribute values are conditionally independent for a given the class. This makes it perform very well in some data domains, but performs poorly when attributes are dependent. In this paper, we try to identify those dependencies using linear regression on the attribute values, and then eliminate the attributes which are a linear combination of one or two others. We have tested this system on six web domains (extracting the data by parsing the html), where we have added a synthetic attribute which is a linear combination of two of the original ones. The system detects perfectly those synthetic attributes and also some natural dependent attributes, obtaining a more accurate classifier.

	Go Back

Social Media Links

amazon

Search

Login

Top Visited