Generative Models and Discrete Structures for Machine Learning under Noise


Simao Eduardo

The most common data science pipeline starts with raw data, often dirty. Error detection and subsequent error repair are important steps in the data cleaning process. Moreover data quality rules have to be enforced and updated constantly in order to keep value to the business.

Thus there is a need to infer, apply and monitor data quality rules (constraints, ontologies, or relations). Both error detection and repair often use these rules that are either inferred automatically, by a data analyst or knowledge expert. These are often part of the schema (i.e. blueprint) of relational databases (KB), or knowledge graphs (KG). It is unrealistic to expect a human to perform rule discovery for data quality or even error detection, particularly without any tools given the intricacy and the size of today’s datasets. 

On the other hand deep generative models (DGM) have become quite successful in unsupervised learning tasks. However, a lot of work still remains to be done in order to extend these to include/model discrete structures, rules and robustness to noise. In fact several datasets are represented by graph structures, or by underlying symbolic information like first-order logic clauses.

One very successful DGM approach is the Variational Autoencoder, a latent variable model. However, the power of deep generative models is still not being fully utilized to tackle these datasets. We aim at changing this by proposing novel methods that tackle these type of datasets, usually under noisy conditions.