Machine Learning for Data Science: Reliability & Automation


Simao Eduardo

The wide adoption of machine learning for predictive modelling, in both science and industry, has led to its application in less than ideal situations without much principle. On the other hand, there are several steps in the data science pipeline that could be automated, where in practice are done manually.
For instance, in applied data science, it is usual to find dirty datasets being used for prediction; non-representative models being used in datasets out of their scope (e.g. concept drift); and obviously situations where data rangling or data repair could be (semi-)automated.
With this in mind, we think that machine learning should focus on the entire data analytics process (1), either tackling it holistically or each step at a time, whilst considering the reliability (2) of the predictions produced.
In the larger pipeline of data science (1), several steps that include data wrangling, data cleaning, exploratory visualisation, data integration, model criticism and revision, and presentation of results to domain experts. These are all steps where machine learning could step in and provide more automation, with or without user interaction (e.g. Active Learning).
In terms of reliability (2), it is often the case that machine learning techniques will be deployed under the presence of novel and potentially adversarial (worst-case) input distributions. Most traditional machine learning techniques are sensitive to these worst-case 
perturbations, also there is the aspect of whether the novelty should be learnt by the system - is it signal or noise? These aspects are important when human monitoring becomes unfeasible, or when machine learning is being used by critical applications - e.g. self-driving cars, stock-market, medical diagnosis.