================================ DOCUMENTATION DATASET SRST 2018 ================================ Author: simon mille Foreword: This document is a living document and may be updated from time to time. Thank you for your understanding :) For the SRST'18, the Track 1 structures have simply been scrambled and the lemmas removed. The Track 2 (deep) dataset consists of predicate-argument structures obtained through the application of graph-transduction grammars to the original UD syntactic structures. The deep and surface structures are aligned node to node. The first thing to know is that the deep input is a compromise between (i) correctness and (ii) adequacy in a generation setup. Indeed, the conversion of the UD structures into predicate-argument structures depends not only on the mapping process, but also on the availability of the information in the original annotation. UD structures have not been thought for NLG applications, which is why it is a real challenge to use them in an NLG setup. The main issue with the Track 2 structures is that they are underspecified, that is, some information is missing. The main reason is that one of our main objective has been to remove as much information as possible that cannot be inferred from a deeper level of abstraction, as, e.g., an ontological representation. For instance, if it is not possible --or too risky- to predict an argument slot, we leave it undefined; if, because the annotation doesn't allow to distinguish between the two, we have a choice between leaving too many syntactic elements or removing meaningful words, we choose to remove. This way, our deep representation is much closer to one actually used in a generation pipeline that starts from abstract data, and the tools trained on the present data could potentially be used in an NLG pipeline. ================ Track 2 features ================ All word features used in the Track 2 are documented in http://universaldependencies.org/u/feat/index.html. The relations are described below. ================= Track 2 relations (and the possible corresponding universal depdencies) ================= *A1: first argument ------- csubj, nsubj, obl_agent *A1INV: inverted first argument (the governor is the first argument of the dependent) ------- acl, acl_relcl, advmod, det, det_predet, nummod *A2: second argument ------- csubj_pass, ccomp, nsubj_pass, obj, xcomp *A2INV: inverted second argument (the governor is the second argument of the dependent) ------- acl, acl_relcl, cc, cc_preconj, mark And in case they are not removed: case, cop *A3: third argument ------- iobj *A3INV: inverted third argument (the governor is the third argument of the dependent) ------- acl, acl_relcl *A4: fourth argument ------- iobj *ADDRESSEE: vocative ------- vocative *AM (underspecified): circumstancial or undefined argument ------- advcl, amod, appos, discourse, dislocated, nmod, nmod_npmod, nmod_poss, nmod_tmod, obl, obl_npmod, obl_tmod And in case they are not removed: clf, expl, expl_pass *AUX: auxiliary (in case an aux could not be removed) ------- aux, aux_pass *DEP: undefined dependent ------- dep And in case they are not removed: punct *LIST: list of elements (coordinations) ------- conj, list *NAME: part of a named entity or fixed construction ------- compound, compound_prt, compound_svc, fixed, flat, flat_foreign, flat_name *PARATAXIS ------- parataxis ==================== Particular phenomena (to be completed) ==================== Alignment between T1 and T2 nodes --------------------------------- On the T2 nodes, we use one or more feature id with as suffix the line number of the correspondig T1 nodes: on a T2 node, id1=4|id2=15 means that this T2 node is aligned with the T1 nodes on the lines 4 and 15 of the corresponding T1 structure. Only elements triggered by other elements (as opposed to be triggered by the structure of the sentence) are aligned with deep nodes. That is, a subcategorized preposition is aligned with a deep node, while a void copula or an expletive subject are not. Argumental relations -------------------- Each defined argumental relation is unique for each predicate: there cannot be two argument with the same slot for one predicate. If a predicate has an A2 dependent, it cannot have another A2 dependent, and it cannot be A2INV of another predicate. Auxiliaries ----------- Auxiliaries are mapped to the universal feature "Aspect". Conjunctions/prepositions ------------------------- The perpositions and conjunctions maintained in the T2 representation can be found under a A2INV dependency. A dependency path Gov-AM-> Dep-A2INV-> Prep is equivalent to a predicate (the conjunction/preposition) with 2 arguments: Gov <-A1-Prep-A2-> Dep. Modals ------ They are mapped to the universal feature "Mood". Pronouns -------- - Relative: only subject and object relative pronouns directly linked to the main relative verb are removed from the T2 structure. - Subject: a dummy pronoun node for subject is added if an originally finite verb has no first argument and no available argument to build a passive; for a pro-drop language such as Spanish, a dummy pronoun is added if the first argument is missing. Punctuations ------------ Only the final punctuations are encoded in the T2 represenations: the main node of a sentence indicates if the latter is declarative, interrogative, exclamative, suspensive, or if it involved in a parataxis, with the feature "clause_type". MAPPING DETAILS =============== *acl ------- - def: clausal modifier of noun (mostly finite or non finite verb). - variants: *acl, *acl_relcl. - PROBLEM: we don't know when the governor is an argument of the dependent or not. - PROBLEM: we don't know if the VFin below is a relative clause or a simple complement of the node above. - in case of gerund/adjective: LABEL A1INV. - in case of past participle: LABEL A2INV. - in case of relative clauses without pronoun, GUESS the argument number. - LABEL AM in other cases. *advcl ---------- - def: adverbial clause modifier. - variants: *advcl. - Ideally, attach to conjunction, but since we remove them, LABEL AM. - The sequence "X-AM-> Y-A2INV-> Conj" can be interpreted as "X <-A1-Conj-A2> Y". *advmod --------- - def: an adverbial modifier of a word is a (non-clausal) adverb or adverbial phrase that serves to modify a predicate or a modifier word. - variants: *advmod. - LABEL A1INV. *amod --------- - def: adverbial clause modifier. - variants: *amod. - PROBLEM: we don't know when the governor is an argument of the dependent or not. - LABEL A1INV if dependent is an adjective, LABEL AM otherwise. *appos ---------- - def: appositional modifier (reversible dependency between two nominal groups). - variants: *appos. - PROBLEM: we don't know when the dependent is an argument or not (as in most annotation schemes for appositions). - LABEL AM by default. *aux -------- - def: auxiliary. - variants: *aux. - REPLACE by appropriate attr/value pair. *case --------- - def: case marking. - variants: *case. - PROBLEM: only in some cases the relation above the noun indicates if the case-marked preposition is functional or not. - REMOVE ALL to see if generators are able to produce them. - This could be improved using a subcategorization lexicon. *cc ------- - def: coordinating conjunction. - variants: *cc, *cc_preconj. - between the last conjunct and the conjunction. - LABEL A2INV. *ccomp ---------- - def: clausal complement (a dependent clause which is a core argument). - variants: *ccomp. - LABEL A2. *clf -------- - def: classifier. - variants: *clf. - seems to be some kind of functional word that indicates a the class of a noun (human, tree, etc.), whis is needed in some contexts, such as quantification (three [human-classifier] student). - REMOVE. *compound ------------- - def: compound (MWE). - variants: *compound, *compound_prt, *compound_svc. - LABEL NAME. *conj --------- - def:conjunct (between the first conjunct of a coordination and each of the following conjuncts). - variants: *conj . - LABEL LIST. *cop -------- - def: copula (linking a nonverbal predicate to the copula). - variants: *cop. - REMOVE copulas. *csubj ---------- - def: clausal subject. - variants: *csubj, *csubj_pass. - LABEL A1, A2 if passive. *dep -------- - def: unspecified dependency. - variants: *dep. - LABEL DEP. *det -------- - def: determiner. - variants: *det, *det_predet. - LABEL A1INV/REMOVE if article. *discourse -------------- - def: discourse element (e.g. interjections). - variants: *discourse. - LABEL AM. *dislocated --------------- - def: dislocated elements. - variants: *dislocated. - LABEL AM. *expl --------- - def: expletive. - variants: *expl, *expl_pass. - REMOVE. *fixed ---------- - def: fixed multiword expression (fixed grammaticized expressions that behave like function words or short adverbials.) - variants: *fixed. - there can be more than one fixed dependency below one head. - REMOVE dep when it is a preposition or a conjunction. - LABEL NAME in other cases. *flat --------- - def: flat multiword expression (exocentric/headless constructions). - variants: *flat, flat_name, flat_foreign. - LABEL NAME. *goeswith ------------- - def: links parts of words that have been erroneously split. - variants: *goeswith. - REMOVE SENTENCES THAT CONTAIN THIS DEPREL. iobj --------- - def: indirect object (nominal). - variants: iobj. - PROBLEM: We cannot be sure of the argument slot; cf documentation about obj: "In general, if there is just one object, it should be labeled obj, regardless of the morphological case or semantic role. For example, in English, teach can take either the subject matter or the recipient as the only object, and in both cases it would be analyzed as the obj." - PROBLEM: only applies to nominal dependents. - PROBLEM: iobj can stand for benefactives also, which can be third arguments or above. - LABEL III by default; RELABEL IV or AM in case of multiple iobj. *list --------- - def: used for chains of comparable items. - variants: *list. - LABEL LIST. *mark --------- - def: marker (word introducing a finite clause subordinate to another clause). - variants: *mark. - the dependent seems to be removable, based on if the dependency above is argumental. - REMOVE if group is in an argumental position (ccomp, csubj, xcomp). - LABEL A2INV otherwise. *nmod --------- - def: nominal modifier. - variants: *nmod, *nmod_poss, *nmod_npmod, *nmod_tmod. - PROBLEM: we don't know if the dependent is an argument or not; a lexicon would be needed in order to distinguish. - LABEL AM by default. *nsubj ---------- - def: nominal subject. - variants: *nsubj, *nsubj_pass. - CAREFUL: used also in cases like "We are in the barn" (barn nsubj-> we). - LABEL A1, A2 if passive. *nummod ----------- - def: numeric modifier. - variants: *nummod. - LABEL A1INV. *obj -------- - def: object. - variants: *obj. - PROBLEM: We cannot be sure of the argument slot; cf documentation about obj: "In general, if there is just one object, it should be labeled obj, regardless of the morphological case or semantic role. For example, in English, teach can take either the subject matter or the recipient as the only object, and in both cases it would be analyzed as the obj." - LABEL A2. *obl -------- - def: oblique nominal (nominal element functioning as non-core/adjunct). - variants: *obl, *obl_agent, *obl_npmod, *obl_tmod. - PROBLEM: some annotation examples show that it sometimes applies to arguments (seems to contradict the definition); ie for an iobj introduced by a preposition; a lexicon would be needed in order to distinguish. - LABEL AM by default; obl_agent: LABEL A1. *orphan ----------- - def: used in cases of head ellipsis where simple promotion would result in unnatural and misleading dependency relation. - variants: *orphan. - REMOVE SENTENCES THAT CONTAIN THIS DEPREL. *parataxis -------------- - def: sentential parenthetical or a clause after a “_” or a “;”. - variants: *parataxis. - LABEL PARATAXIS. *punct ---------- - def: punctuation. - variants: *punct. - REMOVE. *reparandum --------------- - def: disfluencies overridden in a speech repair. - variants: *reparandum. - REMOVE SENTENCES THAT CONTAIN THIS DEPREL. *root --------- - def: root. - variants: *root. - REMOVE. *vocative ------------- - def: used to mark a dialogue participant addressed in a text. - variants: *vocative. - LABEL ADDRESSEE. *xcomp ---------- - def: clausal complement (a dependent clause which is a core argument). - variants: *xcomp. - if the subject of the clausal complement is controlled (i.e., is the same as the higher subject or object). - LABEL A2.