SRST-2020

Surface Realisation Shared Task 2020

July - October 2020

SR'20 Legacy task

Third Multilingual Surface Realisation Workshop

SIGGEN

Task

SR’19 Shared Task

Data

Shallow Track (Track 1): This track starts from vanilla UD structures in which word order information has been removed and tokens have been lemmatised, i.e. the inputs are unordered dependency trees with lemmatised nodes that contain PoS tags and morphological information as found in the original annotations. The task is equivalent to determining the word order and inflecting the words. As indicated above, there will be both a closed (T1a) and an open (T1b) subtrack.
Deep Track (Track 2): This track starts from UD structures from which functional words (in particular, auxiliaries, functional prepositions and conjunctions) and surface-oriented morphological information have been removed. In addition to what has to be done for the Shallow Track, the Deep Track thus involves introducing the removed functional words and morphological features. Again, there will be a closed (T2a) and an open (T2b) subtracks.

Participating teams can choose to address one or both of the two shared task tracks T1 and T2. However, in their chosen track(s), teams are required to submit outputs for the closed mode(s) (T1a and T2a). In addition, teams can choose to also submit to the open mode(s) (T1b and T2b). For instance, if you are mainly interested in the T1b task, you are required ed to also provide outputs for T1a, where you train your models with the provided data resources plus any of the specifically allowed additional resources.

Recommended papers and links

Dataset: Mille et al. (INLG'18)
Overview and results: SR'18, SR'19, SR'20
High-scoring systems: Elder et al. (ACL'20), Yu et al. (ACL'20)
NEW: SR20 system descriptions: MSR workshop proceedings (COLING'20)
NEW: SR20 submissions: Download

Registration

↑

Generation Challenges (GenchChal) repository

Important Dates

12 July 2020

~~10 September~~ 18 October 2020

25 September 2020

~~1 October~~ 18 October 2020

~~5 October~~ 21 October 2020

~~8 October~~ 23 October 2020

~~20 October~~ 28 October 2020

~~20 October~~ 31 October 2020

~~31 October~~ 2 November 2020

12 December 2020

MSR

Information

04/12

The link to access the presentations and to attend the workshop is now available: Underline. The programme is available on the MSR workshop page.

07/10

Due to exceptional circumstances, the deadlines for submissions and the human evaluation have been postponed.

26/09

The test sets were released! Please register to the task to receive the data.

08/09

The deadline for registration to the task has been extended to October 1! Note that we might not be able to ensure the participation to human evaluation in case of late registration.

24/08

The tool for producing T1 and T2 structures from UD structures is now available on GitLab.

12/07

The shared task has been publicly announced and is open!

04/07

The shared task will be launched soon.

Data

↑

Training and development data

Generation Challenges repository

UD page

evaluation section

~~September 10th~~

Conversion tool

GitLab

CoNLL'18 shared task

10-column CoNLL-U format

UD page

Shallow Track

The information on word order was removed by randomised scrambling.
The words were replaced by their lemmas.
Two features were added to store information about (i) relative linear order with respect to the governor (lin), and (ii) alignments with the original structures (original_id, in the training data only).

Languages: Arabic, Chinese, English (4 datasets), French (3), Hindi, Indonesian, Japanese, Korean (2), Portuguese (2), Russian (2) and Spanish (2).

Deep Track

Functional prepositions and conjunctions that can be inferred from other lexical units or from the syntactic structure were removed.
Determiners and auxiliaries are replaced (when needed) by attribute/value pairs, as, e.g., Definiteness, Aspect, and Mood.
Edge labels were generalised into predicate argument labels in the PropBank/NomBank fashion.
Morphological information coming from the syntactic structure or from agreements was removed.
Fine-grained PoS labels found in some treebanks were removed, and only coarse-grained ones were maintained.
Languages: English (4 datasets), French (3) and Spanish (2).

Informal documentation

Open

List of explicitly allowed additional resources for the closed subtracks

Word2vec, including branches such as Word2veccf
ELMo
BERT
GPT-2
polyglot
GloVe
UD parsers such as UUParser
The above models can be fine-tuned if needed using publicly available datasets such as WikiText and the DeepMind Q&A dataset

Example

↑

Sample original CoNLL-U file for English: EN_original

Track 1 training sample for English (CoNLL-U): EN_Track1_conllu

Track 1 input (dev/test) sample for English (CoNLL-U): the alignments with the surface tokens are not provided. EN_Track1_conllu

Track 1 sample structure for English (graphic): ↑ EN_Track1

Track 2 training sample for English (CoNLL-U): EN_Track2_conllu

auxiliary

case

det

cop

Track 2 input (dev/test) sample for English (CoNLL-U): the alignments with the surface and Track 1 tokens are not provided. EN_Track2_conllu

Track 2 sample structure for English (graphic): EN_Track2

Evaluation

↑

tokenised

detokenised

human evaluation

Graham et al., 2016

automatic evaluation

BLEU: precision metric that computes the geometric mean of the n-gram precisions between generated text and reference texts and adds a brevity penalty for shorter sentences. We use the smoothed version and re-port results for n= 1,2,3,4;
NIST: related n-gram similarity metric weighted in favour of less frequent n-grams which are taken to be more informative;
BERTScore: token similarity metric using contextual embeddings;
Normalised edit distance: inverse, normalised, character-based string-edit distance that starts by computing the minimum number of character inserts, deletes and substitutions(all at cost 1) required to turn the system output into the (single) reference text.

Running the evaluation scripts

BLEU, NIST, DIST (Compatible with Python 2 and Python 3)
BERTScore (V.0.3.5, Python 3)

Create virtual environment for python: 'virtualenv nx'.
Activate the virtual environment: 'source nx/bin/activate'.
Installing NLTK: 'pip install -U nltk'.
Command line: python eval_Py2.py [system-dir] [reference-dir].
Use eval_Py3.py instead of eval_Py2 if you use Python 3.4 or above.

Installing NLTK: 'pip install -U nltk'. In order to use pip, you may need to navigate to the directory that contains pip.exe through the command line before running the command.
Command line: eval_P2.py [system-dir] [reference-dir].
Use eval_Py3.py instead of eval_Py2 if you use Python 3.4 or above.

Submission

↑

Submission of the results

msr.organizers@gmail.com

System description papers

Softconf START conference management system

Output specification

tokenised sample (Spanish dev example): #text = Elías Jaua , miembro del Congresillo , considera que los nuevos miembros del CNE deben tener experiencia para " dirigir procesos complejos " .
detokenised sample (Spanish dev example) : #text = Elías Jaua, miembro del Congresillo, considera que los nuevos miembros del CNE deben tener experiencia para "dirigir procesos complejos".

Number of outputs

History

↑

first edition

MSR'18 workshop proceedings

second edition

MSR'19 workshop proceedings

Pilot SR'11

GenChal’11

Contact

msr.organizers@gmail.com

The shared task is organised by the Multilingual Surface Realisation workshop committee: Anya Belz, Bernd Bohnet, Thiago Castro Ferreira, Yvette Graham, Simon Mille and Leo Wanner.

References

Anja Belz, Michael White, Dominic Espinosa, Eric Kow, Deirdre Hogan, and Amanda Stent. 2011. The first surface realisation shared task: Overview and evaluation results. In Proceedings of the 13th European Workshop on Natural Language Generation, ENLG ’11, pages 217–226, Stroudsburg, PA, USA. Association for Computational Linguistics.

Yvette Graham, Timothy Baldwin, Alistair Moffat and Justin Zobel. 2016. Can Machine Translation Systems be Evaluated by the Crowd Alone? In Journal of Natural Language Engineering (JNLE), Firstview.

Simon Mille, Anja Belz, Bernd Bohnet, Leo Wanner. 2018. Underspecified Universal Dependency Structures as Inputs for Multilingual Surface Realisation. In Proceedings of the 11th International Conference on Natural Language Generation, 199-209, Tilburg, The Netherlands.

Simon Mille, Anja Belz, Bernd Bohnet, Yvette Graham, Emily Pitler, Leo Wanner. 2018. The First Multilingual Surface Realisation Shared Task (SR'18): Overview and Evaluation Results. In Proceedings of the 1st Workshop on Multilingual Surface Realisation (MSR), 56th Annual Meeting of the Association for Computational Linguistics (ACL), 1-12, Melbourne, Australia.

Simon Mille, Anja Belz, Bernd Bohnet, Yvette Graham, Leo Wanner. 2019. The Second Multilingual Surface Realisation Shared Task (SR'19): Overview and Evaluation Results. In Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR), 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1-17, Hong Kong, China.

Funding

Photo by Derek Thomson on Unsplash