In its simplest form knowledge distillation consists in training a deep cumbersome architecture (a teacher), and using its outputs as training labels for a shallower architecture (a student). Remarkably this conveys more useful information during training, and yields a better performance than using the one hot targets to start with. While this approach helps to condense a deep model into a more compact one for deployment, it is still expensive to train. My work consists in bypassing the need for training a teacher in order to get labels that are good for training rather than labels that have high accuracy.
Previous: Measurement Error - A Modern Perspective