Title:
|
AN ADAPTIVE SAMPLING ALGORITHM TO IMPROVE THE PERFORMANCE OF CLASSIFICATION MODELS |
Author(s):
|
Soroosh Ghorbani, Michel C. Desmarais |
ISBN:
|
978-989-8704-10-8 |
Editors:
|
Ajith P. Abraham, Antonio Palma dos Reis and Jörg Roth |
Year:
|
2014 |
Edition:
|
Single |
Keywords:
|
Adaptive Sampling, Entropy, Classification, Prediction Performance |
Type:
|
Full Paper |
First Page:
|
21 |
Last Page:
|
28 |
Language:
|
English |
Cover:
|
|
Full Contents:
|
click to dowload
|
Paper Abstract:
|
Given a fixed number of observations to train a model for a classification task, a Selective Sampling design helps decide how to allocate more, or less observations among the variables during the data gathering phase, such that some variables will have a greater ratio of missing values than others. Previous work has shown that selective sampling based on features' entropy can improve the performance of some classification models. We further explore this heuristic to guide the sampling process on the fly, a process we call Adaptive sampling. We focus on three different classification models, Naïve Bayes (NB), Logistic Regression (LR) and Tree Augmented Naive Bayes (TAN), and train them on binary attributes datasets and use a 0/1 loss function to assess their respective performance. We define three different schemes of sampling: 1-Uniform (random samples) as a baseline, 2-Low entropy (greater sampling rate for low entropy items) and 3-High entropy (greater sampling rate for higher entropy items). Then, we propose an algorithm for Adaptive Sampling that uses a small seed dataset to extract the initial entropies and randomly samples feature observations based on the three different schemes. The performance of the combination of schemes and models is assessed on 11 different datasets. The results from 100 fold cross-validation show that Adaptive Sampling based on scheme 3 improves the performance of the TAN model in all but one of the datasets, with an average improvement of 12-14% in RMSE reduction. However, for the Naive Bayes classifier, scheme 2 improves the classification by a factor of 6-12% (with one data set exception). Finally, for Logistic Regression, no clear pattern emerges. |
|
|
|
|