AN ADAPTIVE SAMPLING ALGORITHM TO IMPROVE THE PERFORMANCE OF CLASSIFICATION MODELS

Soroosh Ghorbani; Michel C. Desmarais

Home

Digital Library

Visit Digital Library

Conference Proceedings

IADIS International Conference Theory and Practice in Modern Computing - TPMC

IADIS International Conference Theory and Practice in Modern Computing 2014 (part of MCCSIS 2014)

Document Info

Title:	AN ADAPTIVE SAMPLING ALGORITHM TO IMPROVE THE PERFORMANCE OF CLASSIFICATION MODELS
Author(s):	Soroosh Ghorbani, Michel C. Desmarais
ISBN:	978-989-8704-10-8
Editors:	Ajith P. Abraham, Antonio Palma dos Reis and Jörg Roth
Year:	2014
Edition:	Single
Keywords:	Adaptive Sampling, Entropy, Classification, Prediction Performance
Type:	Full Paper
First Page:	21
Last Page:	28
Language:	English
Cover:
Full Contents:	click to dowload
Paper Abstract:	Given a fixed number of observations to train a model for a classification task, a Selective Sampling design helps decide how to allocate more, or less observations among the variables during the data gathering phase, such that some variables will have a greater ratio of missing values than others. Previous work has shown that selective sampling based on features' entropy can improve the performance of some classification models. We further explore this heuristic to guide the sampling process on the fly, a process we call Adaptive sampling. We focus on three different classification models, Naïve Bayes (NB), Logistic Regression (LR) and Tree Augmented Naive Bayes (TAN), and train them on binary attributes datasets and use a 0/1 loss function to assess their respective performance. We define three different schemes of sampling: 1-Uniform (random samples) as a baseline, 2-Low entropy (greater sampling rate for low entropy items) and 3-High entropy (greater sampling rate for higher entropy items). Then, we propose an algorithm for Adaptive Sampling that uses a small seed dataset to extract the initial entropies and randomly samples feature observations based on the three different schemes. The performance of the combination of schemes and models is assessed on 11 different datasets. The results from 100 fold cross-validation show that Adaptive Sampling based on scheme 3 improves the performance of the TAN model in all but one of the datasets, with an average improvement of 12-14% in RMSE reduction. However, for the Naive Bayes classifier, scheme 2 improves the classification by a factor of 6-12% (with one data set exception). Finally, for Logistic Regression, no clear pattern emerges.

	Go Back

Social Media Links

amazon

Search

Login

Top Visited