Training modern deep neural networks is becoming increasingly expensive and current models often require millions of datapoints. This project aims to ﬁnd ways of reducing the necessary dataset size to facilitate faster training and cheaper data collection and labelling. This is related to the Coreset problem  in which a small summary of a dataset is constructed. The summary is a Coreset if, when used to train a model, it gives performance similar to that of training on the much larger dataset. The result of this is that the training of modern neural networks can be signiﬁcantly sped up and dataset collection and labelling become much cheaper. It is also related to active learning, in which the learner chooses a small amount of datapoints for labelling in order to get the most out of training . The standard training procedure in deep learning is to use a training set Dtrn to ﬁt a model, a validation set Dval to compare diﬀerent models and a test set Dtest to evaluate the performance of the best model. The most time-consuming part is the training stage, where around 70-80% of the data is commonly used. We suggest to instead ﬁnd a small training set D∗trn. Our hypothesis is that we can achieve similar performance by training on D∗trn as by training on Dtrn. This hypothesis is backed up by theoretical work in computational geometry as well as practical algorithms for several other machine learning methods [3, 1, 2]. We identify three approaches of obtaining D∗trn: 1. Find the subset D∗trn ∈ Dtrn which minimises validation loss, 2. Augment a subset D∗trn ∈ Dtrn, 3. Learn a function to generate the set D∗trn. The resulting small training set D∗trn is then evaluated by training a model on it and evaluating that model on the test set. It could be a good idea to evaluate the dataset using multiple architectures and learning rules for increased robustness. Note that it is possible to obtain D∗trn by 1 and then 2.
Supervisors: Tim Hospedales &