A CLASSIFICATION-BASED APPROACH FOR BIBLIOGRAPHIC METADATA DEDUPLICATION

Eduardo N. Borges; Karin Becker; Carlos A. Heuser; Renata Galante

Home

Digital Library

Visit Digital Library

Conference Proceedings

IADIS International Conference WWW/Internet - ICWI

IADIS International Conference WWW/Internet 2011

Document Info

Title:	A CLASSIFICATION-BASED APPROACH FOR BIBLIOGRAPHIC METADATA DEDUPLICATION
Author(s):	Eduardo N. Borges, Karin Becker, Carlos A. Heuser, Renata Galante
ISBN:	978-989-8533-01-2
Editors:	Bebo White, Pedro Isaías and Flávia Maria Santoro
Year:	2011
Edition:	Single
Keywords:	Deduplication, bibliographic metadata, classification, machine learning
Type:	Full Paper
First Page:	221
Last Page:	228
Language:	English
Cover:
Full Contents:	click to dowload
Paper Abstract:	Digital libraries of scientific articles describe them using a set of metadata, including bibliographic references. These references can be represented by several formats and styles. Considerable content variations can occur in some metadata fields such as title, author names and publication venue. Besides, it is quite common to find references that omit same metadata fields such as page numbers. Duplicate entries influence the quality of digital library services once they need to be appropriately identified and treated. This paper presents a comparative analysis among different data classification algorithms used to identify duplicated bibliographic metadata records. We have investigated the discovered patterns by comparing the rules and the decision tree with the heuristics adopted in a previous work. Our experiments show that the combination of specific-purpose similarity functions previously proposed and classification algorithms represent an improvement up to 12% when compared to the experiments using our original approach.

	Go Back

Social Media Links

amazon

Search

Login

Top Visited