Digital Library

cab1

 
Title:      ISOLATING CONTENT AND METADATA FROM WEBLOGS USING CLASSIFICATION AND RULE-BASED APPROACHES
Author(s):      Eric J. Marshall, Eric B. Bell
ISBN:      978-972-8939-40-3
Editors:      Piet Kommers, Nik Bessis and Pedro IsaĆ­as
Year:      2011
Edition:      Single
Keywords:      Weblog, Extraction, Classification
Type:      Short Paper
First Page:      187
Last Page:      191
Language:      English
Cover:      cover          
Full Contents:      click to dowload Download
Paper Abstract:      The emergence and increasing prevalence of social media, such as internet forums, weblogs (blogs), wikis, etc., has created a new opportunity to measure public opinion, attitude, and social structures. A major challenge in leveraging this information is isolating the content and metadata in weblogs, as there is no standard, universally supported, machine-readable format for presenting this information. We present two algorithms for isolating this information. The first uses web block classification, where each node in the Document Object Model (DOM) for a page is classified according to one of several pre-defined attributes from a common blog schema. The second uses a set of heuristics to select web blocks. These algorithms perform at a level suitable for initial use, validating this approach for isolating content and metadata from blogs. The resultant data serves as a starting point for analytical work on the content and substance of collections of weblog pages.
   

Social Media Links

Search

Login