‘Feature learning for large multi-category sparse datasets’

PhD Studentship in Feature learning for large multi-category sparse datasets.

Studentship suitable for EU/UK students only.

We seek strong candidates for a 4-year, joint Masters and PhD research studentship in the EPSRC Centre for Doctoral Training (CDT) in Data Science at the University of Edinburgh, on the topic Feature learning for large multi-category sparse datasets.  This studentship is co-funded by Sainsbury's Bank and jointly supervised by Dr. Amos Storkey, Dr. Nigel Goddard and a member of the Sainsbury’s Bank data science team.

This project will focus on automatic feature determination for data domains with large multi-category sparse datasets. 

Candidates ideally should have a strong mathematical and programming background; research interests in development and application of machine learning methods real-world problems; and the drive and ambition to become a future leader in this exciting area.

Studentship suitable for EU/UK students only.

Project Description:

In machine learning and computer vision, the standard process of learning to classify, detect and predict involved hand engineering salient features for data or images, and then using those features in standard machine learning algorithms for detection, prediction and calculation of risk. A large industry was built up around developing better hand coded features.

More recently, deep convolutional neural network approaches have subsumed this effort: feature learning for images is now an automatic process of the full pipeline optimization methods that are used. Equivalent methods are appropriate on images, game boards, spatial data, sequence data etc., and these are the data domains that have seen substantial application of deep learning methods. However there has been a notable absence of deep learning applications in the broader data domain. Real consumer datasets have features that are quite different from images: fields with very different meanings, highly sparse data, contingent data (A can only happen if B does), relational structure and much higher noise/variation.

Application of modern automatic feature learning for these domains is in its infancy but is of critical importance in most application areas. In particular it is vital in the future development for method on customer data.

This project will focus on automatic feature determination for data domains with large multi-category sparse datasets.  Because such data is not amenable to convolutional methods, a different structural approach needs to be taken. We propose a number of partitioning mechanisms for breaking down and learning appropriate structural decompositions of such data to learn features that are most predictive of a task.

Specifically we apply this to the problem of analysing retail customer data and predicting propensity to purchase banking products. Retail customer data takes a very specific form, and current approaches using e.g. association rules for capturing between field relationships do little towards automated targeted feature representation. One particular direction of substantial promise is through pairwise expand and collapse neural network models. This model learns automated combinations of basket items that are of particular importance in characterising the predictive task. A second challenge would be interpretative methods. We would develop methods to re-project representations into on the one hand representative baskets, and on the other, representative customers as a means of providing transparency about the computations a neural network was doing.

Overall the focus of the project is on extending current feature detection methods to data types that are useful for customer data.


Project Background:

Sainsbury’s Bank builds and maintains a suite of Machine Learning models, which predict key customer behaviours which are linked to business performance. For example a model which predicts the likelihood of a supermarket customer to apply for a credit card.  Machine learning approaches expect data to be pre-processed in numerous ways to ensure the efficacy of the resultant model. Of particular interest is the process of transforming atomic data (i.e. supermarket basket data) to a suite of potentially predictive customer-level variables. 


About the ESPRC CDT in Data Science

The CDT focuses on the computational principles, methods and systems for extracting knowledge from data. Large data sets are now generated by almost every activity in science, society and commerce - ranging from molecular biology to social media, from sustainable energy to health care.  Data science asks: How can we efficiently find patterns in these vast streams of data?  Many research areas have tackled parts of this problem:

  • machine learning focuses on finding patterns and making predictions from data;
  • databases are needed for efficiently accessing data and ensuring its quality;
  • ideas from algorithms are required to build systems that scale to big data streams;
  • the mathematical fields of statistics and optimization provide foundational tools and theory;
  • natural language processing, computer vision and speech processing consider the analysis of different types of unstructured data.

Recently, these distinct disciplines have begun to converge into a single field called Data Science.

The CDT is a 4-year programme: the first year provides Masters level training in the core areas of Data Science, along with a significant project. In years 2-4 students will carry out PhD research in Data Science, guided by PhD supervisors from within the Centre.  The CDT is funded by EPSRC and the University of Edinburgh.

Edinburgh has a large, world-class research community in Data Science to support the work of the CDT student cohort.  The city of Edinburgh has often been voted the 'best place to live in Britain' and has many exciting cultural and student activities.

How to Apply:

Prospective applicants are encouraged to contact Amos Storkey (a.storkey@ed.ac.uk) to discuss the studentship before submitting an application: http://datascience.inf.ed.ac.uk/apply/