A Multi-level Annotated Corpus of Scientific Papers for Scientific Document Summarization


Our corpus provides: (i) related work sections, (ii) a manually annotated layer of cited papers and sentences, (iii) citing papers referring to the cited papers in the related work section, and (iv) a layer of rich linguistic, rhetorical, and semantic annotations computed automatically. While the manually identified cited sentences are useful to support the study of sequence to sequence models in scientific summarization, the new layer of citing papers facilitates the test of citation-based summarization approaches which rely on citation networks to assess sentence relevance.

It includes twenty articles from sources such as the Special Interest Group on Information Retrieval (SIGIR), the Association for Computational Linguistics (ACL), the North American Chapter of the Association for Computational Linguistics (NAACL), the Empirical Methods for Natural Language Processing (EMNLP) and the International Conference on Computational Linguistics (COLING).





Versions:




Basic Corpus (XML Format)

Contains basic information about each scientific paper's contents including: title, authors, affiliations, abstract and paper sections. It also contains the manual annotations that identifies the sentences of each scientific paper by annotating a sentence ID for the GATE documents, this ID was used to help map the sentences during the annotation process.



Rich Corpus (Gate Format)

Same as the basic with many scientific papers related automatic annotations added. Each GATE document was annotated using processing resources from the GATE system , the SUMMA library, and the freely available Dr Inventor library (DRI Framework). The tools semantically enrich the corpus by providing rhetorical annotation, causality identification, coreference, and BabelNet synsets. The SUMMA library was used to produce different normalized term vectors for each document. Vector of terms and BabelNet synsets are created using tf*idf weighting computed from a corpus of 4K ACL scientific papers.