Surface Realization Shared Task 2018

The Generation Challenges

Task description

As in SR’11, the proposed shared task comprises two tracks with different levels of complexity: Either or both of the tracks can be addressed by the participating teams.

Important Dates


The datasets are derived from the Universal Dependency treebanks V2.0, that is, from the data used for the CoNLL'17 shared task on Multilingual Parsing from Raw Text to Universal Dependencies.

Licensing: The Dutch, English, Finnish, French, Portuguese and Spanish datasets are released under the CC BY-SA license for all languages. The Arabic, Czech, Italian and Russian datasets are released under the CC BY-NC-SA license. Please refer to the original page of the Universal Dependency treebanks V2.0 for more details on the original datasets and their licensing.

Format: Following the source datasets, the inputs to the Shallow and Deep Tracks are distributed in the 10-column CoNLL-U format.

Training and development data: Download
Evaluation data: Download
Documentation: Open

For the input to the Shallow Track, the UD structures are processed as follows:
  1. the information on word order is removed by randomized scrambling;
  2. the words are replaced by their lemmas.
Shallow datasets are available for the following languages: Arabic, Czech, Dutch, English, Finnish, French, Italian, Portuguese, Russian and Spanish.

For the Deep Track, additionally:
  1. functional prepositions and conjunctions that can be inferred from other lexical units or from the syntactic structure are removed;
  2. determiners and auxiliaries are replaced (when needed) by attribute/value pairs, as, e.g., Definiteness, Aspect, and Mood;
  3. edge labels are generalized into predicate argument labels in the PropBank/NomBank fashion;
  4. morphological information coming from the syntactic structure or from agreements is removed;
  5. fine-grained PoS labels found in some treebanks are removed, and only coarse-grained ones are maintained.
Deep datasets are available for the following languages: English, French, Spanish.

See the Examples Section below for sample structures.


The test data can be downloaded here: Download. The compressed folder contains the templates to fill with the system outputs. If you submit outputs for both tracks, please submit two distinct folders, one for each track.

We perform both automatic and manual evaluations of the outputs of the systems. For the automatic evaluation, we compute scores with the following metrics:
  • BLEU: precision metric that computes the geometric mean of the n-gram precisions between generated text and reference texts and adds a brevity penalty for shorter sentences. We use the smoothed version and report results for n= 1,2,3,4.;
  • NIST: related n-gram similarity metric weighted in favour of less frequent n-grams, which are taken to be more informative;
  • Normalized edit distance (DIST): inverse, normalized, character-based string-edit distance that starts by computing the minimum number of character inserts, deletes and substitutions (all at cost 1) required to turn the system output into the (single) reference text.

  • For each metric, we calculate (i) system-level scores (if the metric permits it), and (ii) the mean of the sentence-level scores. Output texts are normalized prior to computing metrics by lower-casing all tokens, removing any extraneous whitespace characters and ensuring consistent treatment of ampersands.

    For the human-assessed evaluation, the evaluators are trained annotators in the case of the Google Data Compute evaluation, and crowdsourced annotators in the case of the Mechanical Turk evaluation. They assess sentences for Readability and Meaning Similarity with respect to a reference sentence. Quality insurance mechanisms ensure the quality of the collected annotations.
  • Readability instructions: "The quality criterion you need to assess is Readability. This is sometimes called ‘fluency’, and your task is to decide how well the given text reads; is it good fluent English, or does it have grammatical errors, awkward constructions, etc. Please rate the text by moving the slider to the position that corresponds to your rating, where 0 is the worst, and 100 is the best rating."
  • Meaning Similarity instructions: "The quality criterion you need to assess is Meaning Similarity. You need to read both texts, and then decide how close in meaning the second text (in black) is to the first (in grey). Please use the slider at the bottom of the page to express your rating. The closer in meaning the second text clipping is to the first, the further to the right (towards 100) you need to place the slider. In other words, a rating of 100% would mean that the meaning of the two text clippings is exactly identical."

  • Running the evaluation scripts

    The evaluation scripts can be downloaded here:
  • Python 2.7
  • Python 3.4

  • Requirements: Python 2.7 or 3.4 and NLTK. For a clean environment virtualenv can be used:

    Mac OS / Ubuntu
  • Create virtual environment for python: 'virtualenv nx'
  • Activate the virtual environment: 'source nx/bin/activate'
  • Installing NLTK: 'pip install -U nltk'
  • Command line: python eval_Py2.py [system-dir] [reference-dir]
  • Use eval_Py3.py instead of eval_Py2 if you use Python 3.4 or above.

  • Windows
  • Installing NLTK: 'pip install -U nltk'. In order to use pip, you may need to navigate to the directory that contains pip.exe through the command line before running the command.
  • Command line: eval_P2.py [system-dir] [reference-dir].
  • Use eval_Py3.py instead of eval_Py2 if you use Python 3.4 or above.

  • The [system-dir] folder contains the output of the system and the [reference-dir] folder contains the reference sentences. The evaluation script uses each file found in the system director [system-dir] to look up a file with the same name in the reference directory and applies BLEU, NIST and normalized edit distance to it.


    21 teams registered for the task , and 8 sumbitted results: ADAPT, AX, IIT-BHU, OSU, NILC, Tilburg, DipInfo-UniTo and BinLin. The descripion of the systems and the task overview and results can be found in the MSR workshop proceedings.

    Automatic evaluation results:

    Human evaluation results - Google Data Compute:

    Human evaluation results - Mechanical Turk:


  • The outputs of the different systems and some of the human evaluations will be released soon.

  • 01/10
  • To cite the task, please use the following reference:

  • @inproceedings{mille-EtAl:2018:MSR-SR,
    title={The {F}irst {M}ultilingual {S}urface {R}ealisation {S}hared {T}ask ({SR}'18): {O}verview and {E}valuation {R}esults},
    author={Mille, Simon and Belz, Anja and Bohnet, Bernd and Graham, Yvette and Pitler, Emily and Wanner, Leo},
    booktitle={Proceedings of the 1st Workshop on Multilingual Surface Realisation (MSR), 56th Annual Meeting of the Association for Computational Linguistics ({ACL})},
    address={Melbourne, Australia},

  • The final dates for system outputs and descriptions submissions have been established.

  • 09/04
  • The test data evaluation scripts are available! See Evaluation.

  • 19/03
  • There are no restrictions with respect to the use of external resources for the Shared Task (corpora, lexicons, etc.).
  • The test sets, evaluation scripts and templates for outputs will be available on April 9th.
  • Once the evaluation results are available, each team will be asked if the results of their system can be released publicly, and if not, will have the possibility to anonymize their team name, which implies not publishing a system description paper.
  • Confirm participation: if you already know that you will or won't submit at least one system, please notify us; knowing in advance roughly how many systems we will be evaluating will really help us.

  • Registration and Submission

    The task is now over!

    Output specification
    Teams are supposed to submit a single text file (UTF-8 encoding) in the format appended below, aligned with the respective input files. All output sentences have to start with the text marker '# text = ', and be preceded by the sentence ID ('# sent_id ='). In other words: Example: A file that contains both the sentence IDs aligned with the test data and the empty '# text' field is provided to the participants (see Evaluation). For null outputs, the '# text = ' field should remain empty.

    Number of outputs
    We allow one output per system; each team is allowed to submit several different systems, but please avoid submitting variations of what is essentially the same system. We may have to limit each team to a single nominated system for the human evaluations.


    The structures for Track 1 and 2 are connected trees; the data has the same columns as the original CoNLL-U format; however, for the SR'18, the reference (original) sentences are stored in separate files, aligned with the respective structures.


    Sample original CoNLL-U file for English:


    Sample Track 1 Input for English (CoNLL-U):


    Sample Track 1 Input for English (graphic):


    Sample Track 2 Input for English (CoNLL-U):


    The nodes of the Track 2 structures are aligned with the nodes of the Track 1 structures through the attributes id1, id2, id3, etc. Each node can correspond to 0 to 6 superficial nodes. More examples of multiple correspondences are shown in the French and the Spanish examples, in particular with the auxiliary and case relations.

    Sample Track 2 Input for English (graphic):



    Sample original CoNLL-U file for French; The charges laid against Kadhafi will be officially abandoned when the Court receives the evidence that he was killed last October 2nd:


    Sample Track 1 Input for French (CoNLL-U):


    Sample Track 1 Input for French (graphic):


    Sample Track 2 Input for French (CoNLL-U):


    Sample Track 2 Input for French (graphic):



    Sample original CoNLL-U file for Spanish; The criminal has been able to know that Don Antonio was going to marry shortly a beautiful and honest young woman, whom we have been unable to locate:


    Sample Track 1 Input for Spanish (CoNLL-U):


    Sample Track 1 Input for Spanish (graphic):


    Sample Track 2 Input for Spanish (CoNLL-U):


    Sample Track 2 Input for Spanish (graphic):



    The results of the Surface Realization Shared Task 2018 were presented during the Multilingual Surface Realization Workshop at ACL 2018.

    A previous pilot surface realization task has been run in 2011 (Pilot SR'11) as part of Generation Challenges 2011 (GenChal’11), which was the fifth round of shared-task evaluation competitions (STECs) involving the generation of natural language. The results session for all GenChal’11 tasks was held as an integral part of ENLG’11 in Nancy which attracted around 50 delegates. GenChal’11 followed four previous events: the Pilot Attribute Selection for Generating Referring Expressions (ASGRE) Challenge in 2007 which held its results meeting at UCNLG+MT in Copenhagen, Denmark; Referring Expression Generation (REG) Challenges in 2008, with a results meeting at INLG’08 in Ohio, US; Generation Challenges 2009 with a results meeting at ENLG’09 in Athens, Greece; and Generation Challenges 2010 with a results meeting at INLG’10 in Trim, Ireland.


    If you have any question or comment, please write to us: srst18@upf.edu