This project is in the area of neural machine translation (MT), which is using deep learning techniques to build systems for translating text or speech from one language to another. I am going to focus on infinite context neural MT models, i.e. models based on neural networks (NNs) such as recurrent NNs, or LSTMs. Their advantage is that in order to predict the next word in the translated sentence, they use a neural state vector, which is a summarization of the full history (part of sentence already translated). As a result, they can capture long distance dependencies between words.
One of the main issues is that NNs are limited by their speed, both in training and subsequent applications. For instance, they are too slow to translate the monolingual data for back translation (i.e. creating synthetic parallel data), which has been shown to be beneficial for training MT systems. There are 1.8 trillion words of English we want to exploit, but training the NN already takes weeks for billions of words in the training dataset. We are also interested in constructing more complicated systems, for example combined left-to-right and right-to-left translation models.
The issue is that implementing NNs in decoding is not trivial. If we want to consider a large number of possible translations, scoring them all separately would slow down the decoding process greatly. Even though the resultant translation might be of higher quality, if the task requires much longer running time, it is not very practical.
There are procedures for classical statistical MT which allow us to consider big number of possible translations without scoring every one of them, such as beam search or cube pruning. These rely on recombination: grouping multiple partial translations together and scoring them simultaneously. This is done with no accuracy loss since the translations in one group are always ensured to have equal future scores (this is the result of having only finite context). Thus using these MT models, one can consider an exponential number of translations while only spending time on scoring a linear number of them.
Unfortunately, in the infinite context neural models, each partial translation is represented by a high dimensional, continuous valued state vector. In practice, these vectors are never equal for two different partial translations, and so we are not able to group them without introducing approximation loss. Using the techniques mentioned above does not enable us to save time on scoring the translations, and we only consider a linear number of translations.
In my PhD, I will design and implement a coarse-to-fine decoding algorithm, which will start from a very coarse pass leading to a single group of candidate translations, and then iteratively break groups of the highest error (or uncertainty) until it produces the required number of translations. This will be used in a decoder inside the encoder-decoder MT model. The aim is to identify clusters containing promising translations and only break those, without spending time on the many translations of low quality produced in the process.
Following the analysis I have done in my Master's thesis, I will first implement the algorithm with clustering based on word history, which is very fast. If this algorithm turns out to work well (in terms of finding a good approximate translation in short time), we will use it to build more complicated MT systems. For example, we will construct combined left-to-right and right-to-left translation models: to predict the next word, we could do a fast (coarse) pass through the left-to-right model to approximately complete the full sentence, and then use this completion to directly obtain a score from the right-to-left model as well.