Natural Language Processing

for Intelligent Access

to Scientific Information

#NLP4scipub on Twitter
Tutorial @ COLING 2016
on December 11th 2016 from 14:00 to 17:00
Horacio Saggion and Francesco Ronzano
Natural Language Processing Group
Department of Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona


Go top

During the last decade the amount of scientific information available on-line increased at an unprecedented rate. Recent estimates reported that a new paper is published every 20 seconds.
As a consequence, nowadays researchers are overwhelmed by an enormous and continuously growing number of articles to consider when they perform any activity that requires a careful and comprehensive assessment of scientific literature, like the exploration of advances in specific topics, peer reviewing, writing and evaluation of proposals.

Natural Language Processing Technology represents a key enabling factor in providing scientists with intelligent patterns to access to scientific information. Extracting information from scientific papers, for example, can contribute to the development of rich scientific knowledge bases which can be leveraged to support intelligent knowledge access and question answering. Summarization techniques can reduce the size of long papers to their essential content or automatically generate state-of-the-art-reviews. Paraphrase or textual entailment techniques can contribute to the identification of relations across different scientific textual sources. This tutorial provides an overview of the most relevant tasks related to the processing of scientific documents, including but not limited to the in-depth analysis of the structure of the scientific articles, their semantic interpretation, content extraction and summarization.


Go top
The tutorial provides an overview of how we can rely on Information Extraction and Natural Language Processing techniques to get the most out of scientific literature. In particular, we will review current approaches, tools, language resources and datasets useful to structurally and semantically analyze a wide range of facets of scientific publications.
Here there is a list of some of the core topics that will be presented during the tutorial:

  • Scientific information overload: challenges and opportunities
  • Mining the structure of scientific publications: PDF files and XML data formats
  • The semantics of papers: scientific discourse and citation analysis
  • Extracting information from scientific publications
  • Scientific document summarization
  • Language resources for scientific text mining
  • Social Media and Science: new opportunities
  • Scientific text mining projects, portals and challenges
  • Hands-on demo of the Dr. Inventor Text Mining Framework
Tutorial Slides Download


Go top
Horacio Saggion Horacio Saggion holds a PhD in Computer Science from Universite de Montreal, Canada. He obtained his BSc in Computer Science from Universidad de Buenos Aires in Argentina, and his MSc in Computer Science from UNICAMP in Brazil. Horacio is an Associate Professor at the Department of Information and Communication Technologies, Universitat Pompeu Fabra (UPF), Barcelona. He is a member of the Natural Language Processing group where he works on automatic text summarization, text simplification, information extraction, sentiment analysis and related topics. His research is empirical combining symbolic, pattern-based approaches and statistical and machine learning techniques. Before joining Universitat Pompeu Fabra, he worked at the University of Sheffield for a number of UK and European research projects (SOCIS, MUMIS, MUSING, GATE, CUBREPORTER) developing competitive human language technology. He was also an invited researcher at John Hopkins University for a project on multilingual text summarization. He is currently principal investigator for UPF in several EU and national projects. Horacio has published over 100 works in leading scientific journals, conferences, and books in the field of human language technology. He organized four international workshops in the areas of text summarization and information extraction and was co-chair of STIL 2009. He is co-editor of a book on multilingual, multisource information extraction and summarization published by Springer in 2013. Horacio is member of the ACL, IEEE, ACM, and SADIO. He is a regular programme committee member for international conferences such as ACL, EACL, COLING, EMNLP, IJCNLP, IJCAI and is an active reviewer for international journals in computer science, information processing, and human language technology. Horacio has given courses, tutorials, and invited talks at a number of international events including LREC, ESSLLI, IJCNLP, NLDB, and RuSSIR.
Francesco Ronzano Francesco Ronzano holds a PhD in Information Engineering from the University of Pisa, Italy. Francesco is currently a Researcher of the Natural Language Processing Group (TALN) at the Department of Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona, where he deals with machine learning approaches for information extraction and text summarization, with special focus on scientific publishing and social media analysis. Francesco has several years of research experience mainly in the context of National (Italian and Spanish) and European Research Projects related to the exploitation of machine learning approaches and Web technologies to foster Language Technologies. His research interests include on-line data semantics, machine learning, knowledge representation and Semantic Web applications.
Francesco has contributed to more than 40 publications among book chapters, journal articles, conference papers. He acted as reviewer to international conferences including AAAI, EMNLP, LREC, RANLP, etc.


Go top
Here there is a list of some core references related to this tutorial: