Package edu.upf.taln.dri.lib.model.util
Class DocParse
- java.lang.Object
-
- edu.upf.taln.dri.lib.model.util.DocParse
-
public class DocParse extends Object
Collection of utility methods extract specific data from document annotations
-
-
Constructor Summary
Constructors Constructor Description DocParse()
-
Method Summary
All Methods Static Methods Concrete Methods Deprecated Methods Modifier and Type Method Description static String
getDocumentROSasCSVstring(Document doc, SentenceSelectorENUM sentenceSelector)
This method generates "NODE -> SUBJECT -> VERB_NODE" and "NODE -> OBJECT -> VERB_NODE" and "CAUSE_NODE -> CAUSE -> EFFECT_NODE" triples.
Because of coreference resolution, the non verb NODES (subjects and objects of SUBJECT / OBJECT triples) can represent the aggregation, performed by means of a coreference resolutor, of several nodes coming from several sentence.
In case a non verb NODE is the rsult of the aggregation of nodes by coreference resolutions, the text / name of the NODE is the sequence of the texts / names of the nodes aggregated, separated by three underscores (name1___name2___name3).static String
getSentencesCSVstring(Document doc, SentenceSelectorENUM sentenceSelector)
This method will return a CSV with a number of rows equal to the number of sentences in the document.
The first CSV column is the document-unambiguous ID of the sentence as referred by the field NODE_SENT_ID of the ROS CSV of the ROS CSV generated by the method getDocumentROSasCSVstring(edu.upf.taln.dri.lib.model.Document doc, SentenceSelectorENUM sentenceSelector) of this class.
The second column is the space separated list of tokens of the sentence.
The third column is the Rhetorical Class of the sentence.
The fourth column is the name of the section that contains the sentence (if any).
The fifth column is the nesting level of the section that contains the sentence (if any).static String
getTokenExpressionCSVstring(Document doc, SentenceSelectorENUM sentenceSelector)
This method will return a CSV withstatic String
getTokenROSasCSVstring(Document doc, SentenceSelectorENUM sentenceSelector, DocGraphTypeENUM graphType)
Deprecated.IMPORTANT: this method is deprecated in favor of the methodgetDocumentROSasCSVstring
that generates a more compact and useful ROSstatic gate.Document
sanitize(gate.Document doc)
-
-
-
Method Detail
-
getTokenROSasCSVstring
@Deprecated public static String getTokenROSasCSVstring(Document doc, SentenceSelectorENUM sentenceSelector, DocGraphTypeENUM graphType) throws InternalProcessingException
Deprecated.IMPORTANT: this method is deprecated in favor of the methodgetDocumentROSasCSVstring
that generates a more compact and useful ROSGet the CSV representation of the sentence graphs of a DRI Document object by extracting from a document a CSV table with the following column-tab-separated row format to support ROS population:
SENTENCE_ID
START_NODE_ID START_NODE_WORD START_NODE_LEMMA START_NODE_POS START_NODE_CHAIN_ID START_NODE_COREF_WORD
END_NODE_ID END_NODE_WORD END_NODE_LEMMA END_NODE_POS END_NODE_CHAIN_ID START_NODE_COREF_WORD
EDGE_TYPE
SENT_RHETL_CLASS SENT_SECTION_NAME SENT_ROOT_SECTION_NAME SENT_DOC_POSITION SENT_NUM_CITS
The following features are equal for all the triples of a sentence:
- SENT_RHETL_CLASS: the rhetorical class assigned to the sentence
- SENT_SECTION_NAME: the name of the section the sentence belongs to
- SENT_ROOT_SECTION_NAME: the name of the root section the sentence belongs to (if any, in case SENT_SECTION_NAME is a sub-section name)
- SENT_DOC_POSITION: the position of the sentence inside the document - number in the interval [0,1]
- SENT_NUM_CITS: the number of citations that the sentence includes
SENTENCE_ID, START_NODE_ID, END_NODE_ID and CHAIN_ID are CSV unique identifiers.
Important: you need to enable the following modules to correctly extract ROS graphs: GraphParsing, CoreferenceResolution, CausalityParsing
--- Coreferences ---
The column useful to identify the coreference chain a node belongs to and the group of nodes that form together an element of the coreference chain are 4:
START_NODE_CHAIN_ID, START_NODE_COREF_WORD, END_NODE_CHAIN_ID, START_NODE_COREF_WORD
Each triple / row of the CSV identifies the relation among two nodes by means of a specific property (SBJ, OBJ, CAUSE, etc.).
Each element of a coreference chain can span over one or more nodes. As a consequence we can get an element of a corefernce chain that is: "the sunny sky", made of three nodes: the, sunny, sky.
Each one of these three nodes (the, sunny, sky) have its own ID (NODE_ID column), independently of its role in a triple.
All the nodes belonging to a coreference chain will have both the NODE_CHAIN_ID and the NODE_COREF_WORD columns with values different from 'NONE'.
In particular for each node blonging to a coreference chain:
- the column NODE_CHAIN_ID identifies with an integer ID the coreference chain the nodes belongs to. As a consequence, all the nodes belonging to the same coreference chain will share the same integer ID in their NODE_CHAIN_ID column;
- the column NODE_COREF_WORD all the nodes that belong to a coreference chain have a value different from "NONE" for this column. This column is the real name of the node in the coreference chain, that usually includes surrounding words.
For instance, if we have the following element of a coreference chain: "the sunny sky", such element will span over three nodes in the CSV: 'the', 'sunny', 'sky'. In this case, one of these three nodes ('sky' - the head node of the coreference chain element) will have both its NODE_CHAIN_ID and its NODE_COREF_WORD values diferent from "NONE". In particular, the NODE_CHAIN_ID value of the 'sky' node will be equal to the integer that identifies all the nodes of that coreference chain. While the NODE_COREF_WORD of the 'sky' node will be equal to "the sunny sky", that is the complete name of that element of the coreference chain.
The other two nodes, 'the' and 'sunny', (non head nodes of the coreference chain element) will both have their NODE_CHAIN_ID and NODE_COREF_WORD values equal to "NONE".
Example 1:
"1","18","Karla","karla","NNP","542","Karla","28","lived","live","VBD","NONE","NONE","SBJ","STILL_NOT_EXECUTED_RHETORICAL_CLASSIFICATION","","","0.125","0"
"3","80","that","that","WDT","NONE","NONE","82","had","have","VBD","NONE","NONE","SBJ","STILL_NOT_EXECUTED_RHETORICAL_CLASSIFICATION","","","0.25","0"
"3","52","she","she","PRP","542","she","54","saw","saw","VBD","NONE","NONE","SBJ","STILL_NOT_EXECUTED_RHETORICAL_CLASSIFICATION","","","0.25","0"
The first row represents the triple "Karla" --> SBJ --> "lived". The node "Karla" (with id 18) belongs to the coreference chain with id "542".
The third row represents the triple "she" --> SBJ --> "saw". The node "she" (with id 52) belongs to the coreference chain with id "542" that is the same coreference chain of the node Karla.
Both nodes "Karla" (with id 18) and "she" (with id 52) are coreferent, refer to the smae entity and can be merged in the same node.
Example 2:
"11","173","hunter","hunter","NN","533","the hunter","175","wanted","want","VBD","NONE","NONE","SBJ","STILL_NOT_EXECUTED_RHETORICAL_CLASSIFICATION","","","0.75","0"
This row represents the triple "hunter" --> SBJ --> "wanted". The node "hunter" (with id 173) belongs to the coreference chain with id "533".
The actual name of this node in the coreference chain is "the hunter" (as we can see from the seventh column - START_NODE_COREF_WORD), since the coreference chin element that has as head the node "hunter" span over this node and the node "the" that is included in the coreference chain element name of the node "hunter".- Parameters:
doc
-sentenceSelector
-graphType
-- Returns:
- Throws:
InternalProcessingException
-
getDocumentROSasCSVstring
public static String getDocumentROSasCSVstring(Document doc, SentenceSelectorENUM sentenceSelector)
This method generates "NODE -> SUBJECT -> VERB_NODE" and "NODE -> OBJECT -> VERB_NODE" and "CAUSE_NODE -> CAUSE -> EFFECT_NODE" triples.
Because of coreference resolution, the non verb NODES (subjects and objects of SUBJECT / OBJECT triples) can represent the aggregation, performed by means of a coreference resolutor, of several nodes coming from several sentence.
In case a non verb NODE is the rsult of the aggregation of nodes by coreference resolutions, the text / name of the NODE is the sequence of the texts / names of the nodes aggregated, separated by three underscores (name1___name2___name3). For instance a node with text / name 'she___Karla' means that such node aggregated all nodes with text 'she' or 'Karla' by means of the coreference resolutor (both nodes 'she' and 'Karla' belongs to the same coreference chain, thus they are aggregated in a single node).
Important: you need to enable the following modules to correctly extract ROS graphs: GraphParsing, CoreferenceResolution, CausalityParsing.
IMPORTANT: this method tries to extend the verb node and the subject or object nodes that does not belong to any coreference chains by incorporating useful information.
For instance, the sentence: "Karla tried to identify her friend." will generate the triples:
- "Karla" --> "tried to identify" --> "her friend"
where the three verbal tokens 'tried', 'to', 'identify' are merged as well as the two nominal tokens 'her' and 'friend'.
Each line of the document ROS CSV generated by this method has the following columns:
- EDGE_ID: integer, to identify the edge - unambiguous among the edge IDs of the document
- FROM_NODE_ID: unambiguous from node ID
- FROM_NODE_NAME: from node name
- FROM_NODE_HEAD_WORD: the head word of the node. When the NODE_NAME is a multiword expression, this field contains the head word of the multiword. When the NODE_NAME is a single word, the head word is equal to NODE_NAME. In general, for non-coreference / merged nodes the head word consists of a single token / word. In case of coreference nodes, this field contains a comma separated list of all the head words of the merged nodes.
- FROM_NODE_RHET_CLASSES: the rhetorical class of the sentence that includes the node. If a nodes is a coreference one, thus resulting from the aggregations of various co-referring nodes, this field is equal to the list of comma-separated rhetorical classes each one belonging to at least one of the co-referring nodes (i.e. "DRI_Approach, DRI_FuturWork")
- FROM_NODE_SENT_ID: the ID of the sentence this nodes belongs to (empty in case of coreference nodes, since they result from the merging of nodes belonging to one or more sentences)
- FROM_NODE_SENT_TOKENS: the position(s) of the tokens of this node - list of integer starting from 0 to point out the first token -, relative to the sentence in which the same node occurs (empty in case of coreference nodes, resulting from the merging of nodes belonging to different sentences)
- TO_NODE_ID: unambiguous to node ID
- TO_NODE_NAME: to node name
- TO_NODE_HEAD_WORD: the head word of the node. When the NODE_NAME is a multiword expression, this field contains the head word of the multiword. When the NODE_NAME is a single word, the head word is equal to NODE_NAME. In general, for non-coreference / merged nodes the head word consists of a single token / word. In case of coreference nodes, this field contains a comma separated list of all the head words of the merged nodes.
- TO_NODE_RHET_CLASSES: the rhetorical class of the sentence that includes the node. If a nodes is a coreference one, thus resulting from the aggregations of various co-referring nodes, this field is equal to the list of comma-separated rhetorical classes each one belonging to at least one of the co-referring nodes (i.e. "DRI_Approach, DRI_FuturWork")
- TO_NODE_SENT_ID: the ID of the sentence this nodes belongs to (empty in case of coreference nodes, since they result from the merging of nodes belonging to one or more sentences)
- TO_NODE_SENT_TOKENS: the positions of the tokens of this node - list of integer starting from 0 to point out the first token -, relative to the sentence in which the same node occurs (empty in case of coreference nodes, resulting from the merging of nodes belonging to different sentences)
- EDGE_TYPE: could be SBJ, OBJ or CAUSE
EXAMPLE DOCUMENT TEXT:
The proposed method considers the relationships among rigid body parts and is more general since it can handle motions of close interactions with / without tangles .
GENERATED DOCUMENT ROS CSV:
"0","117","it___The proposed method","method, it","DRI_Approach","","","25","is general","is","DRI_Approach","1","11, 13","SBJ"
"1","117","it___The proposed method","method, it","DRI_Approach","","","9","considers","considers","DRI_Approach","1","3","SBJ"
"2","13","relationships","relationships","DRI_Approach","1","5","9","considers","considers","DRI_Approach","1","3","OBJ"
"3","39","motions","motions","DRI_Approach","1","18","35","can handle","can","DRI_Approach","1","16, 17","OBJ"
"4","117","it___The proposed method","method, it","DRI_Approach","","","35","can handle","can","DRI_Approach","1","16, 17","SBJ"
"5","35","can handle","can","DRI_Approach","1","16, 17","9","considers","considers","DRI_Approach","1","3","CAUSE"
GENERATED SENTNECE CSV:
"1","The proposed method considers the relationships among rigid body parts and is more general since it can handle motions of close interactions with / without tangles ."
COMMENTS ON THE GENERATED DOCUMENT ROS CSV:
The node with ID 117 is the result of the coreference of the nodes 'it' and 'The proposed method' - the name of this node separates the names of the merged nodes by three underscores (___).
Since the node with ID 117 is the result of a coreference / merging of two nodes, it has NODE_SENT_ID and NODE_SENT_TOKENS fields empty.
All the nodes belongs to sentences with rhetorical category equal to "DRI_Approach" (indeed all the nodes belong to the only sentence of the document classified as DRI_Approach).
The node with name "can handle" (ID equal to "35") is made of two tokens, at positions 16 and 17 (NODE_SENT_TOKENS) of the sentence with id equal to 1 (NODE_SENT_ID).
If we consider the tokens of the EXAMPLE DOCUMENT TEXT, we can find at position 16 and 17 respectively the tokens 'can' and 'handle' that are the two tokens of the node with name "can handle" (ID equal to "35"). Remember that the first token of the sentence has position equal to 0.
The head words of the with ID 117, resulting from the merging of the coreferent nodes 'it' and 'The proposed method', are 'method' and 'it'.
The head word of the node with ID 25 ("is general") is the word 'is'
To get the association of a SENT_ID (ID of the sentence) to the actual contents of the sentence (list of tokens separated by space), we can use the method of this classgetSentencesCSVstring
.
This method will return a CSV with the first two columns respectively equal to: the first column is the ID of the sentence as referred by the field NODE_SENT_ID of the ROS CSV and the second column is related the space separated list of tokens of the sentence.
In the example above the CSV returned by the methodgetSentencesCSVstring
is (only one sentence with id equal to 1):
"1","The proposed method considers the relationships among rigid body parts and is more general since it can handle motions of close interactions with / without tangles .","DRI_Approach","","","false"
**********************
A more complex example:
EXAMPLE DOCUMENT TEXT:
Since kinematic constraints can usually be represented by single equations, they can be easily embedded into optimization problems for motion synthesis.
However, extension to motions involving character shapes seems difficult since relationships between rigid bodies or surfaces need to be encoded.
Further, these methods cannot handle close interactions without any tangles.
The proposed method considers the relationships among rigid body parts and is more general since it can handle motions of close interactions with/without tangles.
GENERATED DOCUMENT ROS CSV:
"0","349","they___kinematic constraints","constraints, they","DRI_Approach","","","15","can be represented","can","DRI_Approach","1","3, 5, 6","SBJ"
"1","15","can be represented","can","DRI_Approach","1","3, 5, 6","32","can be embedded","can","DRI_Approach","1","12, 13, 15","CAUSE"
"2","349","they___kinematic constraints","constraints, they","DRI_Approach","","","32","can be embedded","can","DRI_Approach","1","12, 13, 15","SBJ"
"3","57","extension","extension","DRI_Challenge","3","2","69","However seems difficult","seems","DRI_Challenge","3","0, 8, 9","SBJ"
"4","67","shapes","shapes","DRI_Challenge","3","7","63","involving","involving","DRI_Challenge","3","5","OBJ"
"5","87","need to be encoded","need","DRI_Challenge","3","17, 18, 19, 20","69","However seems difficult","seems","DRI_Challenge","3","0, 8, 9","CAUSE"
"6","75","relationships","relationships","DRI_Challenge","3","11","87","need to be encoded","need","DRI_Challenge","3","17, 18, 19, 20","SBJ"
"7","101","methods","methods","DRI_Approach","5","3","168","Further can not handle","can","DRI_Approach","5","0, 4, 5, 6","SBJ"
"8","351","close interactions","interactions","DRI_Approach","","","168","Further can not handle","can","DRI_Approach","5","0, 4, 5, 6","OBJ"
"9","353","it___The proposed method","method, it","DRI_Approach","","","140","is general","is","DRI_Approach","7","11, 13","SBJ"
"10","353","it___The proposed method","method, it","DRI_Approach","","","124","considers","considers","DRI_Approach","7","3","SBJ"
"11","128","relationships","relationships","DRI_Approach","7","5","124","considers","considers","DRI_Approach","7","3","OBJ"
"12","154","motions","motions","DRI_Approach","7","18","150","can handle","can","DRI_Approach","7","16, 17","OBJ"
"13","353","it___The proposed method","method, it","DRI_Approach","","","150","can handle","can","DRI_Approach","7","16, 17","SBJ"
"14","150","can handle","can","DRI_Approach","7","16, 17","124","considers","considers","DRI_Approach","7","3","CAUSE"
GENERATED SENTNECE CSV:
"1","Since kinematic constraints can usually be represented by single equations , they can be easily embedded into optimization problems for motion synthesis .","DRI_Approach","","","false"
"3","However , extension to motions involving character shapes seems difficult since relationships between rigid bodies or surfaces need to be encoded .","DRI_Challenge","","","false"
"5","Further , these methods can not handle close interactions without any tangles .","DRI_Approach","","","false"
"7","The proposed method considers the relationships among rigid body parts and is more general since it can handle motions of close interactions with / without tangles .","DRI_Approach","","","false"- Parameters:
doc
-sentenceSelector
-- Returns:
-
getSentencesCSVstring
public static String getSentencesCSVstring(Document doc, SentenceSelectorENUM sentenceSelector)
This method will return a CSV with a number of rows equal to the number of sentences in the document.
The first CSV column is the document-unambiguous ID of the sentence as referred by the field NODE_SENT_ID of the ROS CSV of the ROS CSV generated by the method getDocumentROSasCSVstring(edu.upf.taln.dri.lib.model.Document doc, SentenceSelectorENUM sentenceSelector) of this class.
The second column is the space separated list of tokens of the sentence.
The third column is the Rhetorical Class of the sentence.
The fourth column is the name of the section that contains the sentence (if any).
The fifth column is the nesting level of the section that contains the sentence (if any). 1 is a first-level section, 2 is a section nested in a first level section and so on.
The sixth column is "true" if the sentence contains one or more citations, otherwise "false".
If the processed document would be made of one sentence, the CSV returned would be (only one sentence with id equal to 1):
"1","The proposed method considers the relationships among rigid body parts and is more general since it can handle motions of close interactions with / without tangles .","DRI_Approach","","","false"- Parameters:
doc
-sentenceSelector
-- Returns:
-
getTokenExpressionCSVstring
public static String getTokenExpressionCSVstring(Document doc, SentenceSelectorENUM sentenceSelector)
This method will return a CSV with- Parameters:
doc
-sentenceSelector
-- Returns:
-
sanitize
public static gate.Document sanitize(gate.Document doc)
-
-