Title:
|
AUTOMATED BLOG CLASSIFICATION: A CROSS DOMAIN APPROACH |
Author(s):
|
Elisabeth Lex , Christin Seifert , Michael Granitzer , Andreas Juffinger |
ISBN:
|
978-972-8924-93-5 |
Editors:
|
Pedro IsaĆas, Bebo White and Miguel Baptista Nunes |
Year:
|
2009 |
Edition:
|
1 |
Keywords:
|
Blogs, Classification, Cross-Domain |
Type:
|
Full Paper |
First Page:
|
598 |
Last Page:
|
605 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
The automated classification of blogs is highly important for the relatively new field of blog analysis. To classify blogs
into topics or other categories, usually supervised text classification algorithms are applied. However, supervised text
classifiers need a sufficient large amount of labeled data to learn a good model. Especially for blogs, data labeled with
terms that capture current and actual topics are not available and data labeled in the past is usually not applicable due to
topic drifts. Besides, tagged blogs collected from the web exhibit a vocabulary that is rather heterogeneous, diverse and
not commonly agreed upon. In our work, we focus on news-related blogs dealing with current events. Our goal is to
classify blog posts into given, common newspaper categories. As a baseline, we have high quality labeled data from a
German news corpus. Our approach is to exploit the labeled data from the news corpus and use this knowledge to
perform cross-domain classification on the unlabeled blogs. We need a solution with high performance, because both our
corpora are dynamic and our classifier model needs to be up-to-date. In this work, we evaluated a number of text
classification algorithms with different parameter settings by means of accuracy and complexity. Qualitative and
quantitative analysis revealed that a recently proposed centroid-based algorithm, the Class-Feature-Centroid classifier
(CFC), serves best for our setting because it achieves a comparable accuracy with state-of-the-art text classifiers and
outperforms all other algorithms regarding complexity and memory consumption. |
|
|
|
|