Title:
|
USING WELL DEFINED TOKENS IN SIMILARITY FUNCTION FOR RECORD MATCHING IN DATA CLEANING TECHNIQUES |
Author(s):
|
Rawshan Basha |
ISBN:
|
972-99353-6-X |
Editors:
|
Nuno Guimarães and Pedro Isaías |
Year:
|
2005 |
Edition:
|
2 |
Keywords:
|
Data Cleaning, Elimination of Duplicates, Data Linkage. |
Type:
|
Short Paper |
First Page:
|
190 |
Last Page:
|
194 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
The integration of information is an important area of research in databases. The duplicate elimination problem of detecting database records that are approximate duplicates, but not exact duplicates, which describe the same real world entity, is an important data cleaning problem. To ensure high data quality, data warehouse must cleanse data by detecting and eliminating the redundant data. During Data Cleaning process multiple records identified that are syntactically differ but semantically equivalent, by using data mining techniques. This paper used similarity function to detects and eliminates duplication by using well-defined tokens for record matching in a domain- independent Algorithm, for detecting and removing duplicate records which makes the real data ready for mining techniques. Existing data cleaning techniques rely heavily on full or partial domain knowledge. |
|
|
|
|