Package edu.upf.taln.dri.lib.model
Class DocumentImpl
- java.lang.Object
-
- edu.upf.taln.dri.lib.model.DocumentImpl
-
- All Implemented Interfaces:
Document
public class DocumentImpl extends Object implements Document
IMPORTANT: Never instantiate directly this class!
To get an instance of a Document by theDocument interface
, you have always to use one of theFactory
methods:
-Factory.createNewDocument()
-Factory.createNewDocument(String absoluteFilePath)
-Factory.createNewDocument(File file)
-Factory.getEmptyDocument()
-
-
Field Summary
Fields Modifier and Type Field Description DocCacheManager
cacheManager
protected static gate.CorpusController
corpusController_preprocess_XGAPPpreprocStep1
protected static gate.CorpusController
corpusController_preprocess_XGAPPpreprocStep2
protected static gate.CorpusController
corpusController_XGAPPcausality
protected static gate.CorpusController
corpusController_XGAPPcitMarker
protected static gate.CorpusController
corpusController_XGAPPcorefMentionSpot
protected static gate.CorpusController
corpusController_XGAPPheader
protected static gate.CorpusController
corpusController_XGAPPmetaAnnotator
protected static LexRankSummarizer
LexRankSummarizer_Resource
protected static TitleSimSummarizer
TitleSimSummarizer_Resource
-
Constructor Summary
Constructors Constructor Description DocumentImpl()
DocumentImpl(gate.Document gateDoc)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
cleanUp()
Call this method only WHEN YOU ARE SURE YOU WILL NOT USE THE DOCUMENT NO MORE IN YOUR DATA.List<Citation>
extractCitations()
Get the list of citations extracted from the document.DependencyGraph
extractDocumentGraph(SentenceSelectorENUM sentenceSel)
Get the graph representing a portion of a document.Header
extractHeader()
Extract the information retrieved by parsing the header of the paperList<Section>
extractSections(Boolean onlyRoot)
Get the sections (or a subset of the sections) of the documentSentence
extractSentenceById(int sentenceId)
Get one of the sentences of the document by idDependencyGraph
extractSentenceGraph(int sentenceId, SentGraphTypeENUM graphType)
Get the graph representing a sentence.List<Sentence>
extractSentences(SentenceSelectorENUM sentenceSel)
Load the list of sentences of the document, ordered by their occurrence in the document.List<Sentence>
extractSummary(int sentNumber, SummaryTypeENUM summaryType)
Generate a summary of the paper by selecting a relevant set of sentences.List<CandidateTermOcc>
extractTerminology()
Load the list of terms extracted from the document.String
getName()
Get the name of the documentString
getRawText()
Get the raw text of the document (UTF-8 encoded)SourceENUM
getSourceDocumentType()
Get the original document type from which theDocument
instance has been created.Document
getXMLDocument()
Get the contents of the document as an instance of org.w3c.dom.DocumentString
getXMLString()
Get the XML string-serialized contents of the document, as a string (UTF-8 char encoding)static void
initDocPointers(ImporterPDFX ref1, ImporterPDFEXT ref2, ImporterGROBID ref3, ImporterJATS ref4, Map<LangENUM,MateParser> ref5, RhetoricalClassifier ref6, TermAnnotator ref7, MetaAnnotator ref8, LanguageDetector ref9, InlineCitationSpotter ref10, CitationLinker ref11, BiblioEntryParser ref12, HeaderAnalyzer ref13, CorefChainBuilder ref14, BabelnetAnnotator ref15, LexRankSummarizer ref16, TitleSimSummarizer ref17, gate.CorpusController ref18, gate.CorpusController ref19, gate.CorpusController ref20, gate.CorpusController ref21, gate.CorpusController ref22, gate.CorpusController ref23, gate.CorpusController ref24)
boolean
isCleanUp()
Check if the document data structures has been cleaned by calling the cleanUp() method.void
loadXML(File file)
Load the XML string-serialized contents of the document (UTF-8) from a filevoid
loadXML(String absoluteFilePath)
Load the XML string-serialized contents of the document (UTF-8) from a file, by specifying the file's absolute pathvoid
loadXMLString(String XMLStringContents)
Load the XML string-serialized contents of the document from a string (UTF-8 char encoding)void
parsingBabelNet(boolean force)
void
parsingCausality(boolean force)
void
parsingCitations_Enrich(boolean force)
void
parsingCitations_Link(boolean force)
void
parsingCitations_Spot(boolean force)
void
parsingCoref(boolean force)
void
parsingDep(boolean force)
void
parsingHeader(boolean force)
void
parsingMetaAnnotations(boolean force)
void
parsingRhetoricalClass(boolean force)
void
parsingSentences(boolean force)
void
parsingSummary(boolean force)
void
parsingTerminology(boolean force)
void
preprocess()
Pre-compute the text analysis of the document in order to speed-up the execution of the extract-methods.void
resetBabelNet()
void
resetCausality()
void
resetCitations_Enrich()
void
resetCitations_Link()
void
resetCitations_Spot()
void
resetCoref()
void
resetDep()
void
resetDocumentExtractionData()
This method deletes all the data extracted from the original document including sentences, terminology, citations, etc.void
resetHeader()
void
resetMetaAnnotations()
void
resetRhetoricalClass()
void
resetSummary()
void
resetTerminology()
-
-
-
Field Detail
-
cacheManager
public DocCacheManager cacheManager
-
LexRankSummarizer_Resource
protected static LexRankSummarizer LexRankSummarizer_Resource
-
TitleSimSummarizer_Resource
protected static TitleSimSummarizer TitleSimSummarizer_Resource
-
corpusController_preprocess_XGAPPpreprocStep1
protected static gate.CorpusController corpusController_preprocess_XGAPPpreprocStep1
-
corpusController_preprocess_XGAPPpreprocStep2
protected static gate.CorpusController corpusController_preprocess_XGAPPpreprocStep2
-
corpusController_XGAPPheader
protected static gate.CorpusController corpusController_XGAPPheader
-
corpusController_XGAPPcitMarker
protected static gate.CorpusController corpusController_XGAPPcitMarker
-
corpusController_XGAPPcorefMentionSpot
protected static gate.CorpusController corpusController_XGAPPcorefMentionSpot
-
corpusController_XGAPPcausality
protected static gate.CorpusController corpusController_XGAPPcausality
-
corpusController_XGAPPmetaAnnotator
protected static gate.CorpusController corpusController_XGAPPmetaAnnotator
-
-
Constructor Detail
-
DocumentImpl
public DocumentImpl()
-
DocumentImpl
public DocumentImpl(gate.Document gateDoc) throws InternalProcessingException
- Throws:
InternalProcessingException
-
-
Method Detail
-
initDocPointers
public static void initDocPointers(ImporterPDFX ref1, ImporterPDFEXT ref2, ImporterGROBID ref3, ImporterJATS ref4, Map<LangENUM,MateParser> ref5, RhetoricalClassifier ref6, TermAnnotator ref7, MetaAnnotator ref8, LanguageDetector ref9, InlineCitationSpotter ref10, CitationLinker ref11, BiblioEntryParser ref12, HeaderAnalyzer ref13, CorefChainBuilder ref14, BabelnetAnnotator ref15, LexRankSummarizer ref16, TitleSimSummarizer ref17, gate.CorpusController ref18, gate.CorpusController ref19, gate.CorpusController ref20, gate.CorpusController ref21, gate.CorpusController ref22, gate.CorpusController ref23, gate.CorpusController ref24)
-
getName
public String getName() throws InternalProcessingException
Description copied from interface:Document
Get the name of the document- Specified by:
getName
in interfaceDocument
- Returns:
- Throws:
InternalProcessingException
-
loadXML
public void loadXML(String absoluteFilePath) throws DRIexception
Description copied from interface:Document
Load the XML string-serialized contents of the document (UTF-8) from a file, by specifying the file's absolute path- Specified by:
loadXML
in interfaceDocument
- Parameters:
absoluteFilePath
- the absolute path of the file with the XML string-serialized contents of the document to load- Throws:
DRIexception
-
loadXMLString
public void loadXMLString(String XMLStringContents) throws DRIexception
Description copied from interface:Document
Load the XML string-serialized contents of the document from a string (UTF-8 char encoding)- Specified by:
loadXMLString
in interfaceDocument
- Parameters:
XMLStringContents
- the String with the XML serialized contents to load- Throws:
InternalProcessingException
DRIexception
-
loadXML
public void loadXML(File file) throws DRIexception
Description copied from interface:Document
Load the XML string-serialized contents of the document (UTF-8) from a file- Specified by:
loadXML
in interfaceDocument
- Parameters:
file
- the file with the XML string-serialized contents of the document to load- Throws:
DRIexception
-
getXMLString
public String getXMLString() throws InternalProcessingException
Description copied from interface:Document
Get the XML string-serialized contents of the document, as a string (UTF-8 char encoding)- Specified by:
getXMLString
in interfaceDocument
- Returns:
- the String representing the contents of the Document
- Throws:
InternalProcessingException
-
getXMLDocument
public Document getXMLDocument() throws InternalProcessingException
Description copied from interface:Document
Get the contents of the document as an instance of org.w3c.dom.Document- Specified by:
getXMLDocument
in interfaceDocument
- Returns:
- the document as an instance of the class org.w3c.dom.Document
- Throws:
InternalProcessingException
-
getRawText
public String getRawText() throws InternalProcessingException
Description copied from interface:Document
Get the raw text of the document (UTF-8 encoded)- Specified by:
getRawText
in interfaceDocument
- Returns:
- the UTF-8 encoded text of the document
- Throws:
InternalProcessingException
-
preprocess
public void preprocess() throws InternalProcessingException
Description copied from interface:Document
Pre-compute the text analysis of the document in order to speed-up the execution of the extract-methods.- Specified by:
preprocess
in interfaceDocument
- Throws:
InternalProcessingException
-
extractSections
public List<Section> extractSections(Boolean onlyRoot) throws InternalProcessingException
Description copied from interface:Document
Get the sections (or a subset of the sections) of the document- Specified by:
extractSections
in interfaceDocument
- Parameters:
onlyRoot
- if equal to true, extract only the top level sections (h1)- Returns:
- Throws:
InternalProcessingException
-
extractSentences
public List<Sentence> extractSentences(SentenceSelectorENUM sentenceSel) throws InternalProcessingException
Description copied from interface:Document
Load the list of sentences of the document, ordered by their occurrence in the document. If sentences have not been extracted, the first time this method is executed the document text is split into sentences.- Specified by:
extractSentences
in interfaceDocument
- Parameters:
sentenceSel
- the type of sentence to select- Returns:
- the set of sentences in document order
- Throws:
InternalProcessingException
-
extractSentenceById
public Sentence extractSentenceById(int sentenceId) throws InternalProcessingException
Description copied from interface:Document
Get one of the sentences of the document by id- Specified by:
extractSentenceById
in interfaceDocument
- Returns:
- null if the sentence id is null or not a valid id
- Throws:
InternalProcessingException
-
extractTerminology
public List<CandidateTermOcc> extractTerminology() throws DRIexception
Description copied from interface:Document
Load the list of terms extracted from the document. If the terminology has not been extracted from the document, the first time this method is executed relevant terms are extracted from the document.- Specified by:
extractTerminology
in interfaceDocument
- Returns:
- the set of sentences in document order
- Throws:
DRIexception
-
extractSummary
public List<Sentence> extractSummary(int sentNumber, SummaryTypeENUM summaryType) throws InternalProcessingException
Description copied from interface:Document
Generate a summary of the paper by selecting a relevant set of sentences. Sentences are ordered by their relevance in descending order.- Specified by:
extractSummary
in interfaceDocument
- Parameters:
sentNumber
- from 1 to 30- Returns:
- Throws:
InternalProcessingException
-
extractSentenceGraph
public DependencyGraph extractSentenceGraph(int sentenceId, SentGraphTypeENUM graphType) throws DRIexception
Description copied from interface:Document
Get the graph representing a sentence. The id of the sentence can be retrieved by the methodextractSentences()
NB: experimental sentence graphs merging approach implemented- Specified by:
extractSentenceGraph
in interfaceDocument
- Returns:
- Throws:
DRIexception
-
extractDocumentGraph
public DependencyGraph extractDocumentGraph(SentenceSelectorENUM sentenceSel) throws DRIexception
Description copied from interface:Document
Get the graph representing a portion of a document. The nodes of the graph are merged by relying on co-reference chains.- Specified by:
extractDocumentGraph
in interfaceDocument
- Returns:
- Throws:
DRIexception
-
extractHeader
public Header extractHeader() throws InternalProcessingException
Description copied from interface:Document
Extract the information retrieved by parsing the header of the paper- Specified by:
extractHeader
in interfaceDocument
- Returns:
- Throws:
InternalProcessingException
-
extractCitations
public List<Citation> extractCitations() throws InternalProcessingException
Description copied from interface:Document
Get the list of citations extracted from the document.- Specified by:
extractCitations
in interfaceDocument
- Returns:
- Throws:
InternalProcessingException
-
parsingHeader
public void parsingHeader(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetHeader
public void resetHeader() throws InternalProcessingException
- Throws:
InternalProcessingException
-
parsingSentences
public void parsingSentences(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
parsingCitations_Spot
public void parsingCitations_Spot(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetCitations_Spot
public void resetCitations_Spot() throws InternalProcessingException
- Throws:
InternalProcessingException
-
parsingCitations_Link
public void parsingCitations_Link(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetCitations_Link
public void resetCitations_Link() throws InternalProcessingException
- Throws:
InternalProcessingException
-
parsingCitations_Enrich
public void parsingCitations_Enrich(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetCitations_Enrich
public void resetCitations_Enrich() throws InternalProcessingException
- Throws:
InternalProcessingException
-
parsingDep
public void parsingDep(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetDep
public void resetDep() throws InternalProcessingException
- Throws:
InternalProcessingException
-
parsingCoref
public void parsingCoref(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetCoref
public void resetCoref() throws InternalProcessingException
- Throws:
InternalProcessingException
-
parsingCausality
public void parsingCausality(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetCausality
public void resetCausality() throws InternalProcessingException
- Throws:
InternalProcessingException
-
parsingBabelNet
public void parsingBabelNet(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetBabelNet
public void resetBabelNet() throws InternalProcessingException
- Throws:
InternalProcessingException
-
parsingRhetoricalClass
public void parsingRhetoricalClass(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetRhetoricalClass
public void resetRhetoricalClass() throws InternalProcessingException
- Throws:
InternalProcessingException
-
parsingTerminology
public void parsingTerminology(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetTerminology
public void resetTerminology() throws InternalProcessingException
- Throws:
InternalProcessingException
-
parsingMetaAnnotations
public void parsingMetaAnnotations(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetMetaAnnotations
public void resetMetaAnnotations() throws InternalProcessingException
- Throws:
InternalProcessingException
-
parsingSummary
public void parsingSummary(boolean force) throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetSummary
public void resetSummary() throws InternalProcessingException
- Throws:
InternalProcessingException
-
resetDocumentExtractionData
public void resetDocumentExtractionData() throws InternalProcessingException
Description copied from interface:Document
This method deletes all the data extracted from the original document including sentences, terminology, citations, etc. After calling this method on aDocument
object, the next time sentences, terminology, citations, etc. from the document are accessed, they are extracted again and not read from the output of a previous extraction process execution.- Specified by:
resetDocumentExtractionData
in interfaceDocument
- Throws:
InternalProcessingException
-
getSourceDocumentType
public SourceENUM getSourceDocumentType() throws InternalProcessingException
Description copied from interface:Document
Get the original document type from which theDocument
instance has been created. The set of document types are the values ofSourceENUM
.- Specified by:
getSourceDocumentType
in interfaceDocument
- Returns:
- Throws:
InternalProcessingException
-
cleanUp
public void cleanUp() throws InternalProcessingException
Description copied from interface:Document
Call this method only WHEN YOU ARE SURE YOU WILL NOT USE THE DOCUMENT NO MORE IN YOUR DATA. This method will clean all the document data structures made the memory occupied by these data ready for garbage collection. Note that, if you try to access / call methods of the document after calling this method an exception will be raised to state that the resource has been already closed and its data cleaned.- Specified by:
cleanUp
in interfaceDocument
- Throws:
InternalProcessingException
-
isCleanUp
public boolean isCleanUp() throws InternalProcessingException
Description copied from interface:Document
Check if the document data structures has been cleaned by calling the cleanUp() method. A cleaned up document cannot be used no more; if you try to access / call methods of the document after calling this method an Exception will be raised to state that the resource has been already closed and its data cleaned.- Specified by:
isCleanUp
in interfaceDocument
- Returns:
- true if the document data structures has been cleaned.
- Throws:
InternalProcessingException
-
-