Active Learning is a Strong Baseline for Data Subset Selection
Abstract
Data subset selection refers to the process of finding a small subset of training data such that the predictive performance of a classifier trained on it is close to that of a classifier trained on the full training data. A variety of sophisticated algorithms have been proposed specifically for data subset selection. A closely related problem is the active learning problem developed for semi-supervised learning. The key step of active learning is to identify an important subset of unlabeled data by making use of the currently available labeled data. This paper starts with a simple observation — one can apply any off-the-shelf active learning algorithm in the context of data subset selection. The idea is very simple — we pick a small random subset of data and pretend as if this random subset is the only labeled data, and the rest is not labeled. By pretending so, one can simply apply any off-the-shelf active learning algorithm. After each step of sample selection, we can reveal the label of the selected samples (as if we label the chosen samples in the original active learning scenario) and continue running the algorithm until one reaches the desired subset size. We observe that surprisingly, this active learning-based algorithm outperforms all the current data subset selection algorithms on the benchmark tasks. We also perform a simple controlled experiment to understand better why this approach works well. As a result, we find that it is crucial to find a balance between easy-to-classify and hard-to-classify examples when selecting a subset.