SRST-2019

Surface Realization Shared Task 2019

April 5th - August 19th 2019

SR'19 - SIGGEN event

Multilingual Surface Realization Workshop

SIGGEN

Task

Shallow Track (Track 1): This track starts from genuine UD structures from which word order information has been removed and the tokens have been lemmatized, i.e., from unordered dependency trees with lemmatized nodes that hold PoS tags and morphological information as found in the original annotations. The task consists in determining the word order and inflecting the words.
Deep Track (Track 2): This track starts from UD structures from which functional words (in particular, auxiliaries, functional prepositions and conjunctions) and surface-oriented morphological information have been removed. In addition to what has to be done for the Shallow Track, the Deep Track thus consists in introducing the removed functional words and morphological features.

Either or both of the tracks can be addressed by the participating teams.

Closed task: in order to improve the comparability of the results, no other annotated data than those provided by the organizers can be used for training the systems. However, it is allowed to use available parsers to create a silver standard version of the provided datasets and use them as additional/alternative training material. Furthermore, the use of some publicly available off-the-shelf language models suggested by the participants is allowed (e.g. Word2vec, ELMo, BERT, GPT-2). Please contact the organizers to suggest resources; a list will be maintained in the Data section.
Evaluation: both in-domain and out-of-domain data can be used in the test data. For some languages, automatically parsed texts will also need to be generated.
Data alignment : in the training sets, the Shallow data is fully aligned with the original UD structures, and the Deep data is fully aligned with both the original and Shallow structures.
Relative word order information: relative word order information in Multiword Expressions, punctuations and multiple coordinations is available in the input.
Multiple datasets: for some languages, there are two or more UD datasets; the teams are allowed to choose which dataset(s) they want to use for training their models.

Registration

↑

Closed!

Important Dates

Note that due to a short time frame for the human evaluations, the date of the output collection will not be changed.

~~April 5, 2019~~

~~August 3, 2019~~

~~August 19, 2019~~

*No extension*

~~August 30, 2019~~

~~September 10, 2019~~

~~September 21, 2019~~

~~September 30, 2019~~

November 3, 2019

MSR

Information

08/10

The program of the day is now available on the MSR workshop page.

16/09

The dates of the workshop have been confirmed, and the results and participating systems of the shared task will be presented on November 3rd at EMNLP-IJCNLP'19.

30/08

The anonymized results of the automatic evaluations were released to the participants.

20/08

The task is now closed! 14 teams submitted outputs, the results of the automatic evaluations will be released in the next days.

12/08

Additional information about the output specification has been added.

03/08

The evaluation datasets and instructions for submission have been sent to the participants. Additional information about fine-tuning the off-the-shelf models has been provided in the Data section.

24/07

The samples provided in the Data section have been clarified (training VS test/input samples).

19/07

New pre-trained models suggested by participants have been added the list.

25/06

New information about the evaluation procedure has been released.

15/04

Additional information on the training data: It is allowed to use available parsers to create a silver standard version of the provided datasets and use them as training material.

05/04

The shared task has been launched!

Data

↑

Training and development data

Generation Challenges (GenchChal) repository

UD page

Paper about SR'18 dataset (INLG 2018)

Download

More informal documentation

Open

CoNLL'18 shared task

10-column CoNLL-U format

UD page

Shallow Track

The information on word order was removed by randomized scrambling.
The words were replaced by their lemmas.
Two features were added to store information about (i) relative linear order with respect to the governor (lin), and (ii) alignments with the original structures (original_id, in the training data only).

Languages: Arabic, Chinese, English (4 datasets), French (3), Hindi, Indonesian, Japanese, Korean (2), Portuguese (2), Russian (2) and Spanish (2).

Deep Track

Functional prepositions and conjunctions that can be inferred from other lexical units or from the syntactic structure were removed.
Determiners and auxiliaries are replaced (when needed) by attribute/value pairs, as, e.g., Definiteness, Aspect, and Mood.
Edge labels were generalized into predicate argument labels in the PropBank/NomBank fashion.
Morphological information coming from the syntactic structure or from agreements was removed.
Fine-grained PoS labels found in some treebanks were removed, and only coarse-grained ones were maintained.
Languages: English (4 datasets), French (3) and Spanish (2).

List of authorized additional resources (to be updated)

Word2vec, including branches such as Word2veccf
ELMo
BERT
GPT-2
polyglot
GloVe
UD parsers such as UUParser
The above models can be fine-tuned if needed using publicly available datasets such as WikiText and the DeepMind Q&A dataset

Example

↑

Sample original CoNLL-U file for English: EN_original

Track 1 training sample for English (CoNLL-U): EN_Track1_conllu

Track 1 input (dev/test) sample for English (CoNLL-U): the alignments with the surface tokens are not provided. EN_Track1_conllu

Track 1 sample structure for English (graphic): ↑ EN_Track1

Track 2 training sample for English (CoNLL-U): EN_Track2_conllu

auxiliary

case

det

cop

Track 2 input (dev/test) sample for English (CoNLL-U): the alignments with the surface and Track 1 tokens are not provided. EN_Track2_conllu

Track 2 sample structure for English (graphic): EN_Track2

Evaluation

↑

tokenized

detokenized

human evaluation

Graham et al., 2016

automatic evaluation

BLEU: precision metric that computes the geometric mean of the n-gram precisions between generated text and reference texts and adds a brevity penalty for shorter sentences. We use the smoothed version and re-port results for n= 1,2,3,4.;
NIST: related n-gram similarity metric weighted in favour of less frequent n-grams which are taken to be more informative;
Normalized edit distance: inverse, normalized, character-based string-edit distance that starts by computing the minimum number of character inserts, deletes and substitutions(all at cost 1) required to turn the system output into the (single) reference text.

Running the evaluation scripts

SR19-eval (Compatible with Python 2 and Python 3)
~~Python 2.7~~
~~Python 3.4~~

Create virtual environment for python: 'virtualenv nx'.
Activate the virtual environment: 'source nx/bin/activate'.
Installing NLTK: 'pip install -U nltk'.
Command line: python eval_Py2.py [system-dir] [reference-dir].
Use eval_Py3.py instead of eval_Py2 if you use Python 3.4 or above.

Installing NLTK: 'pip install -U nltk'. In order to use pip, you may need to navigate to the directory that contains pip.exe through the command line before running the command.
Command line: eval_P2.py [system-dir] [reference-dir].
Use eval_Py3.py instead of eval_Py2 if you use Python 3.4 or above.

SRST18 results

Submission

↑

Submission of the results

msr.organizers@gmail.com

System description papers

Output specification

tokenized sample (Spanish dev example): #text = Elías Jaua , miembro del Congresillo , considera que los nuevos miembros del CNE deben tener experiencia para " dirigir procesos complejos " .
detokenized sample (Spanish dev example) : #text = Elías Jaua, miembro del Congresillo, considera que los nuevos miembros del CNE deben tener experiencia para "dirigir procesos complejos".

Number of outputs

History

↑

first edition

MSR'18 workshop proceedings

Pilot SR'11

GenChal’11

Contact

msr.organizers@gmail.com

The shared task is organized by the Multilingual Surface Realization workshop committee: Anja Belz, Bernd Bohnet, Yvette Graham, Simon Mille and Leo Wanner.

References

Anja Belz, Michael White, Dominic Espinosa, Eric Kow, Deirdre Hogan, and Amanda Stent. 2011. The first surface realisation shared task: Overview and evaluation results. In Proceedings of the 13th European Workshop on Natural Language Generation, ENLG ’11, pages 217–226, Stroudsburg, PA, USA. Association for Computational Linguistics.

Yvette Graham, Timothy Baldwin, Alistair Moffat and Justin Zobel. 2016. Can Machine Translation Systems be Evaluated by the Crowd Alone? In Journal of Natural Language Engineering (JNLE), Firstview.

Simon Mille, Anja Belz, Bernd Bohnet, Leo Wanner. 2018. Underspecified Universal Dependency Structures as Inputs for Multilingual Surface Realisation. In Proceedings of the 11th International Conference on Natural Language Generation, 199-209, Tilburg, The Netherlands.

Simon Mille, Anja Belz, Bernd Bohnet, Yvette Graham, Emily Pitler, Leo Wanner. 2018. The First Multilingual Surface Realisation Shared Task (SR'18): Overview and Evaluation Results. In Proceedings of the 1st Workshop on Multilingual Surface Realisation (MSR), 56th Annual Meeting of the Association for Computational Linguistics (ACL), 1-12, Melbourne, Australia.

Funding

Photo at the top of the page by Chris Lawton on Unsplash.