Title:
|
BIASED SAMPLING TECHNIQUES FOR CLUSTERING DATA OF DIFFERENT SIZES AND DENSITIES |
Author(s):
|
Kittisak Kerdprasop, Nittaya Kerdprasop |
ISBN:
|
978-972-8939-23-6 |
Editors:
|
António Palma dos Reis and Ajith P. Abraham |
Year:
|
2010 |
Edition:
|
Single |
Keywords:
|
Density biased sampling, Reservoir sampling, Zipf distribution, Sliding window algorithm, Data clustering, Erlang. |
Type:
|
Full Paper |
First Page:
|
51 |
Last Page:
|
58 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
Clustering is a task of partitioning data into subgroups based on similarity or closest distance. A standard k-means algorithm groups data by firstly assigning all data points to the closest clusters, then determining the new cluster mean of each cluster based on the average value of its members. The algorithm repeats these two steps until it converges; that is until there is no change in cluster means and cluster assignment among data points. Performance of the algorithm depends on the number of data points, number of data clusters, and number of iterations. To speed up the clustering process, we investigate the density-biased reservoir sampling algorithms and then devise an efficient data reduction technique. Instead of simply randomly selecting data for clustering, we evaluate data density using a sliding window model and selectively draw samples with rejection criteria from the dense area. The proposed algorithm has been implemented with a functional programming paradigm using the Erlang language. The declarative style of Erlang facilitates a rapid prototyping and a concise coding, as shown in the paper. Our experimental results reveal the effectiveness of drawing samples of high density from the Zipf distributed data. The shift of cluster means is minimal, whereas the decrease in memory usage is significant. |
|
|
|
|