java.lang.Object
- edu.upf.taln.dri.lib.model.DocumentImpl

All Implemented Interfaces:

Document
```
public class DocumentImpl
extends Object
implements Document
```
IMPORTANT: Never instantiate directly this class!

To get an instance of a Document by the Document interface, you have always to use one of the Factory methods:
- Factory.createNewDocument()
- Factory.createNewDocument(String absoluteFilePath)
- Factory.createNewDocument(File file)
- Factory.getEmptyDocument()

Field Summary

Fields
Modifier and Type	Field	Description
`DocCacheManager`	`cacheManager`
`protected static gate.CorpusController`	`corpusController_preprocess_XGAPPpreprocStep1`
`protected static gate.CorpusController`	`corpusController_preprocess_XGAPPpreprocStep2`
`protected static gate.CorpusController`	`corpusController_XGAPPcausality`
`protected static gate.CorpusController`	`corpusController_XGAPPcitMarker`
`protected static gate.CorpusController`	`corpusController_XGAPPcorefMentionSpot`
`protected static gate.CorpusController`	`corpusController_XGAPPheader`
`protected static gate.CorpusController`	`corpusController_XGAPPmetaAnnotator`
`protected static LexRankSummarizer`	`LexRankSummarizer_Resource`
`protected static TitleSimSummarizer`	`TitleSimSummarizer_Resource`

Constructor Summary

Constructors
Constructor Description

DocumentImpl()

DocumentImpl(gate.Document gateDoc)

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`void`	`cleanUp()`	Call this method only WHEN YOU ARE SURE YOU WILL NOT USE THE DOCUMENT NO MORE IN YOUR DATA.
`List<Citation>`	`extractCitations()`	Get the list of citations extracted from the document.
`DependencyGraph`	`extractDocumentGraph(SentenceSelectorENUM sentenceSel)`	Get the graph representing a portion of a document.
`Header`	`extractHeader()`	Extract the information retrieved by parsing the header of the paper
`List<Section>`	`extractSections(Boolean onlyRoot)`	Get the sections (or a subset of the sections) of the document
`Sentence`	`extractSentenceById(int sentenceId)`	Get one of the sentences of the document by id
`DependencyGraph`	`extractSentenceGraph(int sentenceId, SentGraphTypeENUM graphType)`	Get the graph representing a sentence.
`List<Sentence>`	`extractSentences(SentenceSelectorENUM sentenceSel)`	Load the list of sentences of the document, ordered by their occurrence in the document.
`List<Sentence>`	`extractSummary(int sentNumber, SummaryTypeENUM summaryType)`	Generate a summary of the paper by selecting a relevant set of sentences.
`List<CandidateTermOcc>`	`extractTerminology()`	Load the list of terms extracted from the document.
`String`	`getName()`	Get the name of the document
`String`	`getRawText()`	Get the raw text of the document (UTF-8 encoded)
`SourceENUM`	`getSourceDocumentType()`	Get the original document type from which the `Document` instance has been created.
`Document`	`getXMLDocument()`	Get the contents of the document as an instance of org.w3c.dom.Document
`String`	`getXMLString()`	Get the XML string-serialized contents of the document, as a string (UTF-8 char encoding)
`static void`	initDocPointers(ImporterPDFX ref1, ImporterPDFEXT ref2, ImporterGROBID ref3, ImporterJATS ref4, Map<LangENUM,MateParser> ref5, RhetoricalClassifier ref6, TermAnnotator ref7, MetaAnnotator ref8, LanguageDetector ref9, InlineCitationSpotter ref10, CitationLinker ref11, BiblioEntryParser ref12, HeaderAnalyzer ref13, CorefChainBuilder ref14, BabelnetAnnotator ref15, LexRankSummarizer ref16, TitleSimSummarizer ref17, gate.CorpusController ref18, gate.CorpusController ref19, gate.CorpusController ref20, gate.CorpusController ref21, gate.CorpusController ref22, gate.CorpusController ref23, gate.CorpusController ref24)
`boolean`	`isCleanUp()`	Check if the document data structures has been cleaned by calling the cleanUp() method.
`void`	`loadXML(File file)`	Load the XML string-serialized contents of the document (UTF-8) from a file
`void`	`loadXML(String absoluteFilePath)`	Load the XML string-serialized contents of the document (UTF-8) from a file, by specifying the file's absolute path
`void`	`loadXMLString(String XMLStringContents)`	Load the XML string-serialized contents of the document from a string (UTF-8 char encoding)
`void`	`parsingBabelNet(boolean force)`
`void`	`parsingCausality(boolean force)`
`void`	`parsingCitations_Enrich(boolean force)`
`void`	`parsingCitations_Link(boolean force)`
`void`	`parsingCitations_Spot(boolean force)`
`void`	`parsingCoref(boolean force)`
`void`	`parsingDep(boolean force)`
`void`	`parsingHeader(boolean force)`
`void`	`parsingMetaAnnotations(boolean force)`
`void`	`parsingRhetoricalClass(boolean force)`
`void`	`parsingSentences(boolean force)`
`void`	`parsingSummary(boolean force)`
`void`	`parsingTerminology(boolean force)`
`void`	`preprocess()`	Pre-compute the text analysis of the document in order to speed-up the execution of the extract-methods.
`void`	`resetBabelNet()`
`void`	`resetCausality()`
`void`	`resetCitations_Enrich()`
`void`	`resetCitations_Link()`
`void`	`resetCitations_Spot()`
`void`	`resetCoref()`
`void`	`resetDep()`
`void`	`resetDocumentExtractionData()`	This method deletes all the data extracted from the original document including sentences, terminology, citations, etc.
`void`	`resetHeader()`
`void`	`resetMetaAnnotations()`
`void`	`resetRhetoricalClass()`
`void`	`resetSummary()`
`void`	`resetTerminology()`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - cacheManager
```
public DocCacheManager cacheManager
```
  - LexRankSummarizer_Resource
```
protected static LexRankSummarizer LexRankSummarizer_Resource
```
  - TitleSimSummarizer_Resource
```
protected static TitleSimSummarizer TitleSimSummarizer_Resource
```
  - corpusController_preprocess_XGAPPpreprocStep1
```
protected static gate.CorpusController corpusController_preprocess_XGAPPpreprocStep1
```
  - corpusController_preprocess_XGAPPpreprocStep2
```
protected static gate.CorpusController corpusController_preprocess_XGAPPpreprocStep2
```
  - corpusController_XGAPPheader
```
protected static gate.CorpusController corpusController_XGAPPheader
```
  - corpusController_XGAPPcitMarker
```
protected static gate.CorpusController corpusController_XGAPPcitMarker
```
  - corpusController_XGAPPcorefMentionSpot
```
protected static gate.CorpusController corpusController_XGAPPcorefMentionSpot
```
  - corpusController_XGAPPcausality
```
protected static gate.CorpusController corpusController_XGAPPcausality
```
  - corpusController_XGAPPmetaAnnotator
```
protected static gate.CorpusController corpusController_XGAPPmetaAnnotator
```
- Constructor Detail
  - DocumentImpl
```
public DocumentImpl()
```
  - DocumentImpl
```
public DocumentImpl(gate.Document gateDoc)
             throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
- Method Detail
  - initDocPointers
```
public static void initDocPointers(ImporterPDFX ref1,
                                   ImporterPDFEXT ref2,
                                   ImporterGROBID ref3,
                                   ImporterJATS ref4,
                                   Map<LangENUM,MateParser> ref5,
                                   RhetoricalClassifier ref6,
                                   TermAnnotator ref7,
                                   MetaAnnotator ref8,
                                   LanguageDetector ref9,
                                   InlineCitationSpotter ref10,
                                   CitationLinker ref11,
                                   BiblioEntryParser ref12,
                                   HeaderAnalyzer ref13,
                                   CorefChainBuilder ref14,
                                   BabelnetAnnotator ref15,
                                   LexRankSummarizer ref16,
                                   TitleSimSummarizer ref17,
                                   gate.CorpusController ref18,
                                   gate.CorpusController ref19,
                                   gate.CorpusController ref20,
                                   gate.CorpusController ref21,
                                   gate.CorpusController ref22,
                                   gate.CorpusController ref23,
                                   gate.CorpusController ref24)
```
  - getName
```
public String getName()
               throws InternalProcessingException
```
    Description copied from interface: Document
    
    Get the name of the document
    
    Specified by:
    
    getName in interface Document
    
    Returns:
    
    Throws:
    
    InternalProcessingException
  - loadXML
```
public void loadXML(String absoluteFilePath)
             throws DRIexception
```
    Description copied from interface: Document
    
    Load the XML string-serialized contents of the document (UTF-8) from a file, by specifying the file's absolute path
    
    Specified by:
    
    loadXML in interface Document
    
    Parameters:
    
    absoluteFilePath - the absolute path of the file with the XML string-serialized contents of the document to load
    
    Throws:
    
    DRIexception
  - loadXMLString
```
public void loadXMLString(String XMLStringContents)
                   throws DRIexception
```
    Description copied from interface: Document
    
    Load the XML string-serialized contents of the document from a string (UTF-8 char encoding)
    
    Specified by:
    
    loadXMLString in interface Document
    
    Parameters:
    
    XMLStringContents - the String with the XML serialized contents to load
    
    Throws:
    
    InternalProcessingException
    
    DRIexception
  - loadXML
```
public void loadXML(File file)
             throws DRIexception
```
    Description copied from interface: Document
    
    Load the XML string-serialized contents of the document (UTF-8) from a file
    
    Specified by:
    
    loadXML in interface Document
    
    Parameters:
    
    file - the file with the XML string-serialized contents of the document to load
    
    Throws:
    
    DRIexception
  - getXMLString
```
public String getXMLString()
                    throws InternalProcessingException
```
    Description copied from interface: Document
    
    Get the XML string-serialized contents of the document, as a string (UTF-8 char encoding)
    
    Specified by:
    
    getXMLString in interface Document
    
    Returns:
    
    the String representing the contents of the Document
    
    Throws:
    
    InternalProcessingException
  - getXMLDocument
```
public Document getXMLDocument()
                        throws InternalProcessingException
```
    Description copied from interface: Document
    
    Get the contents of the document as an instance of org.w3c.dom.Document
    
    Specified by:
    
    getXMLDocument in interface Document
    
    Returns:
    
    the document as an instance of the class org.w3c.dom.Document
    
    Throws:
    
    InternalProcessingException
  - getRawText
```
public String getRawText()
                  throws InternalProcessingException
```
    Description copied from interface: Document
    
    Get the raw text of the document (UTF-8 encoded)
    
    Specified by:
    
    getRawText in interface Document
    
    Returns:
    
    the UTF-8 encoded text of the document
    
    Throws:
    
    InternalProcessingException
  - preprocess
```
public void preprocess()
                throws InternalProcessingException
```
    Description copied from interface: Document
    
    Pre-compute the text analysis of the document in order to speed-up the execution of the extract-methods.
    
    Specified by:
    
    preprocess in interface Document
    
    Throws:
    
    InternalProcessingException
  - extractSections
```
public List<Section> extractSections(Boolean onlyRoot)
                              throws InternalProcessingException
```
    Description copied from interface: Document
    
    Get the sections (or a subset of the sections) of the document
    
    Specified by:
    
    extractSections in interface Document
    
    Parameters:
    
    onlyRoot - if equal to true, extract only the top level sections (h1)
    
    Returns:
    
    Throws:
    
    InternalProcessingException
  - extractSentences
```
public List<Sentence> extractSentences(SentenceSelectorENUM sentenceSel)
                                throws InternalProcessingException
```
    Description copied from interface: Document
    
    Load the list of sentences of the document, ordered by their occurrence in the document. If sentences have not been extracted, the first time this method is executed the document text is split into sentences.
    
    Specified by:
    
    extractSentences in interface Document
    
    Parameters:
    
    sentenceSel - the type of sentence to select
    
    Returns:
    
    the set of sentences in document order
    
    Throws:
    
    InternalProcessingException
  - extractSentenceById
```
public Sentence extractSentenceById(int sentenceId)
                             throws InternalProcessingException
```
    Description copied from interface: Document
    
    Get one of the sentences of the document by id
    
    Specified by:
    
    extractSentenceById in interface Document
    
    Returns:
    
    null if the sentence id is null or not a valid id
    
    Throws:
    
    InternalProcessingException
  - extractTerminology
```
public List<CandidateTermOcc> extractTerminology()
                                          throws DRIexception
```
    Description copied from interface: Document
    
    Load the list of terms extracted from the document. If the terminology has not been extracted from the document, the first time this method is executed relevant terms are extracted from the document.
    
    Specified by:
    
    extractTerminology in interface Document
    
    Returns:
    
    the set of sentences in document order
    
    Throws:
    
    DRIexception
  - extractSummary
```
public List<Sentence> extractSummary(int sentNumber,
                                     SummaryTypeENUM summaryType)
                              throws InternalProcessingException
```
    Description copied from interface: Document
    
    Generate a summary of the paper by selecting a relevant set of sentences. Sentences are ordered by their relevance in descending order.
    
    Specified by:
    
    extractSummary in interface Document
    
    Parameters:
    
    sentNumber - from 1 to 30
    
    Returns:
    
    Throws:
    
    InternalProcessingException
  - extractSentenceGraph
```
public DependencyGraph extractSentenceGraph(int sentenceId,
                                            SentGraphTypeENUM graphType)
                                     throws DRIexception
```
    Description copied from interface: Document
    
    Get the graph representing a sentence. The id of the sentence can be retrieved by the method extractSentences() NB: experimental sentence graphs merging approach implemented
    
    Specified by:
    
    extractSentenceGraph in interface Document
    
    Returns:
    
    Throws:
    
    DRIexception
  - extractDocumentGraph
```
public DependencyGraph extractDocumentGraph(SentenceSelectorENUM sentenceSel)
                                     throws DRIexception
```
    Description copied from interface: Document
    
    Get the graph representing a portion of a document. The nodes of the graph are merged by relying on co-reference chains.
    
    Specified by:
    
    extractDocumentGraph in interface Document
    
    Returns:
    
    Throws:
    
    DRIexception
  - extractHeader
```
public Header extractHeader()
                     throws InternalProcessingException
```
    Description copied from interface: Document
    
    Extract the information retrieved by parsing the header of the paper
    
    Specified by:
    
    extractHeader in interface Document
    
    Returns:
    
    Throws:
    
    InternalProcessingException
  - extractCitations
```
public List<Citation> extractCitations()
                                throws InternalProcessingException
```
    Description copied from interface: Document
    
    Get the list of citations extracted from the document.
    
    Specified by:
    
    extractCitations in interface Document
    
    Returns:
    
    Throws:
    
    InternalProcessingException
  - parsingHeader
```
public void parsingHeader(boolean force)
                   throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetHeader
```
public void resetHeader()
                 throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - parsingSentences
```
public void parsingSentences(boolean force)
                      throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - parsingCitations_Spot
```
public void parsingCitations_Spot(boolean force)
                           throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetCitations_Spot
```
public void resetCitations_Spot()
                         throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - parsingCitations_Link
```
public void parsingCitations_Link(boolean force)
                           throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetCitations_Link
```
public void resetCitations_Link()
                         throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - parsingCitations_Enrich
```
public void parsingCitations_Enrich(boolean force)
                             throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetCitations_Enrich
```
public void resetCitations_Enrich()
                           throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - parsingDep
```
public void parsingDep(boolean force)
                throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetDep
```
public void resetDep()
              throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - parsingCoref
```
public void parsingCoref(boolean force)
                  throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetCoref
```
public void resetCoref()
                throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - parsingCausality
```
public void parsingCausality(boolean force)
                      throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetCausality
```
public void resetCausality()
                    throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - parsingBabelNet
```
public void parsingBabelNet(boolean force)
                     throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetBabelNet
```
public void resetBabelNet()
                   throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - parsingRhetoricalClass
```
public void parsingRhetoricalClass(boolean force)
                            throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetRhetoricalClass
```
public void resetRhetoricalClass()
                          throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - parsingTerminology
```
public void parsingTerminology(boolean force)
                        throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetTerminology
```
public void resetTerminology()
                      throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - parsingMetaAnnotations
```
public void parsingMetaAnnotations(boolean force)
                            throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetMetaAnnotations
```
public void resetMetaAnnotations()
                          throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - parsingSummary
```
public void parsingSummary(boolean force)
                    throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetSummary
```
public void resetSummary()
                  throws InternalProcessingException
```
    Throws:
    
    InternalProcessingException
  - resetDocumentExtractionData
```
public void resetDocumentExtractionData()
                                 throws InternalProcessingException
```
    Description copied from interface: Document
    
    This method deletes all the data extracted from the original document including sentences, terminology, citations, etc. After calling this method on a Document object, the next time sentences, terminology, citations, etc. from the document are accessed, they are extracted again and not read from the output of a previous extraction process execution.
    
    Specified by:
    
    resetDocumentExtractionData in interface Document
    
    Throws:
    
    InternalProcessingException
  - getSourceDocumentType
```
public SourceENUM getSourceDocumentType()
                                 throws InternalProcessingException
```
    Description copied from interface: Document
    
    Get the original document type from which the Document instance has been created. The set of document types are the values of SourceENUM.
    
    Specified by:
    
    getSourceDocumentType in interface Document
    
    Returns:
    
    Throws:
    
    InternalProcessingException
  - cleanUp
```
public void cleanUp()
             throws InternalProcessingException
```
    Description copied from interface: Document
    
    Call this method only WHEN YOU ARE SURE YOU WILL NOT USE THE DOCUMENT NO MORE IN YOUR DATA. This method will clean all the document data structures made the memory occupied by these data ready for garbage collection. Note that, if you try to access / call methods of the document after calling this method an exception will be raised to state that the resource has been already closed and its data cleaned.
    
    Specified by:
    
    cleanUp in interface Document
    
    Throws:
    
    InternalProcessingException
  - isCleanUp
```
public boolean isCleanUp()
                  throws InternalProcessingException
```
    Description copied from interface: Document
    
    Check if the document data structures has been cleaned by calling the cleanUp() method. A cleaned up document cannot be used no more; if you try to access / call methods of the document after calling this method an Exception will be raised to state that the resource has been already closed and its data cleaned.
    
    Specified by:
    
    isCleanUp in interface Document
    
    Returns:
    
    true if the document data structures has been cleaned.
    
    Throws:
    
    InternalProcessingException

Constructors
Constructor	Description
`DocumentImpl()`
`DocumentImpl(gate.Document gateDoc)`

Class DocumentImpl

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

cacheManager

LexRankSummarizer_Resource

TitleSimSummarizer_Resource

corpusController_preprocess_XGAPPpreprocStep1

corpusController_preprocess_XGAPPpreprocStep2

corpusController_XGAPPheader

corpusController_XGAPPcitMarker

corpusController_XGAPPcorefMentionSpot

corpusController_XGAPPcausality

corpusController_XGAPPmetaAnnotator

Constructor Detail

DocumentImpl

DocumentImpl

Method Detail

initDocPointers

getName

loadXML

loadXMLString

loadXML

getXMLString

getXMLDocument

getRawText

preprocess

extractSections

extractSentences

extractSentenceById

extractTerminology

extractSummary

extractSentenceGraph

extractDocumentGraph

extractHeader

extractCitations

parsingHeader

resetHeader

parsingSentences

parsingCitations_Spot

resetCitations_Spot

parsingCitations_Link

resetCitations_Link

parsingCitations_Enrich

resetCitations_Enrich

parsingDep

resetDep

parsingCoref

resetCoref

parsingCausality

resetCausality

parsingBabelNet

resetBabelNet

parsingRhetoricalClass

resetRhetoricalClass

parsingTerminology

resetTerminology

parsingMetaAnnotations

resetMetaAnnotations

parsingSummary

resetSummary

resetDocumentExtractionData

getSourceDocumentType

cleanUp

isCleanUp