Learning Morphological Representations for Multilingual NLP


Clara Vania

Languages of the world exhibit different characteristics which pose interesting challenges for language technologies development. We explore probabilistic modeling of morphology---a study of the internal structure of words---which is important for many languages. Languages with rich morphology encode grammatical functions such as subject, verb, or object by changing the surface form of the words. As a result, these languages tend to have a huge number of word forms for a given base form.

Traditional approaches in natural language processing (NLP) mostly model each form as a separate word. However, these approaches are not optimal since they can not model the relationships between words that are morphologically variant and introduce the data sparsity problem. Recently, representation models based on neural networks were proposed to model morphology. These models represent each word from characters, making them compact (less parameters) and able to handle unseen words. What do these character-level models learn about morphology? Can these models replace the need for costly human morphological annotations? This project aims to answer these questions by evaluating different representation models based on characters and analyzing their performance on multilingual NLP tasks such as language modeling and dependency parsing. Our study shows that character-level models are effective across many languages compared to the traditional word-level approaches. However, their performance still do not match the predictive accuracy of models with access to explicit morphological information suggesting that prior knowledge about morphology is still important for the neural network-based models. 


Supervisors: Adam Lopez & Sharon Goldwater