Launch of the CDT in Data Science

3 Nov 2014

Distinguished Lectures to mark the Launch of the EPSRC CDT in Data Science

Date: Monday 3rd November 2014

Time: 14:30-17:30, followed by a reception

Location: G.07 & G.07A, Informatics Forum, University of Edinburgh


Professor Kathleen R. McKeown
Director, Columbia University Institute for Data Sciences and Engineering

Abstract: “At the Intersection of Data Science and Language”
Data science holds the promise to solve many of society’s most pressing challenges. But much of the data necessary to solve problems is locked within volumes of text and speech on the web.  Thus, in many cases, data science can only succeed if paired with natural language processing. In this talk, I will discuss the data science initiative at Columbia University and research within its New Media Center, where we investigate the analysis of news, Twitter, online discussion and as well as texts coming from digital humanities, scientific and other disciplines. I will describe research projects that draw from scientific journals, from historical sources, from online media, and from novels.

Professor Fernando C. N. Pereira
Research Director, Google

Abstract: “Bottom-Up Semantics”
Advances in statistical and machine learning approaches to natural-language analysis have yielded a wealth of methods and applications in information retrieval, speech recognition, machine translation, and information extraction. Yet, even as we enjoy these advances, we recognize that our successes are to a large extent the result of clever exploitation of redundancy in language structure and use, allowing our algorithms to eke out a few useful bits that we can put to work in applications. By focusing on applications that extract a limited amount of information from the text, finer structures such as word order or syntactic structure could be largely ignored in information retrieval or speech recognition. However, by leaving out those finer details, our language-processing systems have been stuck in an "idiot savant" stage where they can find everything but cannot understand anything. Our main language processing challenge is to create robust, accurate, efficient methods that learn to understand the main entities and concepts discussed in any text, and the main claims made. These will enable our systems to answer questions more precisely, to verify and update knowledge bases, and to trace arguments for and against claims throughout the written record. I will argue with examples from my team’s research that we need deeper levels of linguistic analysis to do this. But I will also argue that it is possible to do much that is useful even with our very partial understanding of linguistic and computational semantics, by taking (again) advantage of distributional regularities and redundancy in large text collections to learn effective analysis and understanding rules.

Professor Richard Kenway, Vice Principal for High Performance Computing, and Professor Chris Williams, Director of the CDT in Data Science, from the University of Edinburgh will also give short presentations.

Please see the agenda for a complete outline of the event.

For information about registering, please contact



Professor Kathleen R. McKeown
Kathleen R. McKeown is the Henry and Gertrude Rothschild Professor of Computer Science at Columbia University and she also serves as the Director of the Institute for Data Sciences and Engineering. She served as Department Chair from 1998-2003 and as Vice Dean for Research for the School of Engineering and Applied Science for two years. McKeown received the Ph.D. in Computer Science from the University of Pennsylvania in 1982 and has been at Columbia since then. Her research interests include text summarization, natural language generation, multi-media explanation, question-answering and multi-lingual applications. In 1985 she received a National Science Foundation Presidential Young Investigator Award, in 1991 she received a National  Science Foundation Faculty Award for Women, in 1994 she was selected as a AAAI Fellow, in 2003 she was elected as an ACM Fellow, and in 2012 she was selected as one of the founding Fellows of the Association for Computational Linguistics. In 2010, she received the Anita Borg Institute Women of Vision Award in Innovation for her work on text summarization. McKeown is also quite active nationally. She has served as President, Vice President and Secretary-Treasurer of the Association of Computational. She also served as a board member of the Computing Research Association and as secretary of the board.


Professor Fernando C. N. Pereira
Fernando Pereira is a distinguished researcher at Google, where he leads work on language understanding. His previous positions include chair of the Computer and Information Science department of the University of Pennsylvania, head of the Machine Learning and Information Retrieval department at AT&T Labs, and research and management positions at SRI International. He received a Ph.D. in Artificial Intelligence from the University of Edinburgh in 1982, and he has over 120 research publications on computational linguistics, machine learning, bioinformatics, speech recognition, and logic programming, as well as several patents.  He was elected AAAI Fellow in 1991 for contributions to computational linguistics and logic programming, and ACM Fellow in 2010 for contributions to machine-learning models of natural language and biological sequences.  He was president of the Association for Computational Linguistics in 1993.