Title:
|
SPELLING BASED RANKED CLUSTERING ALGORITHM
TO CLEAN AND NORMALIZE EARLY MODERN
EUROPEAN BOOK TITLES |
Author(s):
|
Evan Bryer, Theppatorn Rhujittawiwat, John R. Rose and Colin F. Wilder |
ISBN:
|
978-989-8704-32-0 |
Editors:
|
Yingcai Xiao, Ajith Abraham and Guo Chao Peng |
Year:
|
2021 |
Edition:
|
Single |
Keywords:
|
Pre-Processing, Clustering, Cleaning, Data Mining, Spellchecking |
Type:
|
Full |
First Page:
|
105 |
Last Page:
|
115 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
The goal of this paper is to modify an existing clustering algorithm with the use of the Hunspell spell checker to
specialize it for the use of cleaning early modern European book title data. Duplicate and corrupted data is a constant
concern for data analysis, and clustering has been identified to be a robust tool for normalizing and cleaning data such as
ours. In particular, our data comprises over 5 million books published in European languages between 1500 and 1800 in
the Machine-Readable Cataloging (MARC) data format from 17,983 libraries in 123 countries. However, as each library
individually catalogued their records, many duplicative and inaccurate records exist in the data set. Additionally, each
language evolved over the 300 year period we are studying, and as such many of the words had their spellings altered.
Without cleaning and normalizing this data, it would be difficult to find coherent trends, as much of the data may be
missed in the query. In previous research, we have identified the use of Prediction by Partial Matching to provide the
most increase in base accuracy when applied to dirty data of similar construct to our data set. However, there are many
cases in which the correct book title may not be the most common, either when only two values exist in a cluster or the
dirty title exists in more records. In these cases, a language agnostic clustering algorithm would normalize the incorrect
title, and lower the overall accuracy of the data set. By implementing the Hunspell spell checker into the clustering
algorithm, using it to rank clusters by the number of words not found in their dictionary, we can drastically lower the
cases of this occurring. Indeed, this ranking algorithm proved to increase the overall accuracy of the clustered data by as
much as 25% over the unmodified Prediction by Partial Matching algorithm. |
|
|
|
|