Title:
|
ISOLATING CONTENT AND METADATA FROM WEBLOGS USING CLASSIFICATION AND RULE-BASED APPROACHES |
Author(s):
|
Eric J. Marshall, Eric B. Bell |
ISBN:
|
978-972-8939-40-3 |
Editors:
|
Piet Kommers, Nik Bessis and Pedro IsaĆas |
Year:
|
2011 |
Edition:
|
Single |
Keywords:
|
Weblog, Extraction, Classification |
Type:
|
Short Paper |
First Page:
|
187 |
Last Page:
|
191 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
The emergence and increasing prevalence of social media, such as internet forums, weblogs (blogs), wikis, etc., has created a new opportunity to measure public opinion, attitude, and social structures. A major challenge in leveraging this information is isolating the content and metadata in weblogs, as there is no standard, universally supported, machine-readable format for presenting this information. We present two algorithms for isolating this information. The first uses web block classification, where each node in the Document Object Model (DOM) for a page is classified according to one of several pre-defined attributes from a common blog schema. The second uses a set of heuristics to select web blocks. These algorithms perform at a level suitable for initial use, validating this approach for isolating content and metadata from blogs. The resultant data serves as a starting point for analytical work on the content and substance of collections of weblog pages. |
|
|
|
|