Package edu.upf.taln.dri.lib.model
Interface Document
-
- All Known Implementing Classes:
DocumentImpl
public interface Document
Interface to access a Document processed by Dr Inventor.
To get an instance of a Document by theDocument interface
, you have always to use one of theFactory
methods:
-Factory.createNewDocument()
-Factory.createNewDocument(String absoluteFilePath)
-Factory.createNewDocument(File file)
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description void
cleanUp()
Call this method only WHEN YOU ARE SURE YOU WILL NOT USE THE DOCUMENT NO MORE IN YOUR DATA.List<Citation>
extractCitations()
Get the list of citations extracted from the document.DependencyGraph
extractDocumentGraph(SentenceSelectorENUM sentenceSel)
Get the graph representing a portion of a document.Header
extractHeader()
Extract the information retrieved by parsing the header of the paperList<Section>
extractSections(Boolean onlyRoot)
Get the sections (or a subset of the sections) of the documentSentence
extractSentenceById(int sentenceId)
Get one of the sentences of the document by idDependencyGraph
extractSentenceGraph(int sentenceId, SentGraphTypeENUM graphType)
Get the graph representing a sentence.List<Sentence>
extractSentences(SentenceSelectorENUM sentenceSel)
Load the list of sentences of the document, ordered by their occurrence in the document.List<Sentence>
extractSummary(int sentNumber, SummaryTypeENUM summaryType)
Generate a summary of the paper by selecting a relevant set of sentences.List<CandidateTermOcc>
extractTerminology()
Load the list of terms extracted from the document.String
getName()
Get the name of the documentString
getRawText()
Get the raw text of the document (UTF-8 encoded)SourceENUM
getSourceDocumentType()
Get the original document type from which theDocument
instance has been created.Document
getXMLDocument()
Get the contents of the document as an instance of org.w3c.dom.DocumentString
getXMLString()
Get the XML string-serialized contents of the document, as a string (UTF-8 char encoding)boolean
isCleanUp()
Check if the document data structures has been cleaned by calling the cleanUp() method.void
loadXML(File file)
Load the XML string-serialized contents of the document (UTF-8) from a filevoid
loadXML(String absoluteFilePath)
Load the XML string-serialized contents of the document (UTF-8) from a file, by specifying the file's absolute pathvoid
loadXMLString(String XMLStringContents)
Load the XML string-serialized contents of the document from a string (UTF-8 char encoding)void
preprocess()
Pre-compute the text analysis of the document in order to speed-up the execution of the extract-methods.void
resetDocumentExtractionData()
This method deletes all the data extracted from the original document including sentences, terminology, citations, etc.
-
-
-
Method Detail
-
getName
String getName() throws InternalProcessingException
Get the name of the document- Returns:
- Throws:
InternalProcessingException
-
getXMLString
String getXMLString() throws InternalProcessingException
Get the XML string-serialized contents of the document, as a string (UTF-8 char encoding)- Returns:
- the String representing the contents of the Document
- Throws:
InternalProcessingException
-
getXMLDocument
Document getXMLDocument() throws InternalProcessingException
Get the contents of the document as an instance of org.w3c.dom.Document- Returns:
- the document as an instance of the class org.w3c.dom.Document
- Throws:
InternalProcessingException
-
loadXML
void loadXML(String absoluteFilePath) throws DRIexception
Load the XML string-serialized contents of the document (UTF-8) from a file, by specifying the file's absolute path- Parameters:
absoluteFilePath
- the absolute path of the file with the XML string-serialized contents of the document to load- Throws:
DRIexception
-
loadXML
void loadXML(File file) throws DRIexception
Load the XML string-serialized contents of the document (UTF-8) from a file- Parameters:
file
- the file with the XML string-serialized contents of the document to load- Throws:
DRIexception
-
loadXMLString
void loadXMLString(String XMLStringContents) throws DRIexception
Load the XML string-serialized contents of the document from a string (UTF-8 char encoding)- Parameters:
XMLStringContents
- the String with the XML serialized contents to load- Throws:
InternalProcessingException
DRIexception
-
getRawText
String getRawText() throws InternalProcessingException
Get the raw text of the document (UTF-8 encoded)- Returns:
- the UTF-8 encoded text of the document
- Throws:
InternalProcessingException
-
preprocess
void preprocess() throws InternalProcessingException
Pre-compute the text analysis of the document in order to speed-up the execution of the extract-methods.- Throws:
InternalProcessingException
-
extractHeader
Header extractHeader() throws InternalProcessingException
Extract the information retrieved by parsing the header of the paper- Returns:
- Throws:
InternalProcessingException
-
extractSections
List<Section> extractSections(Boolean onlyRoot) throws InternalProcessingException
Get the sections (or a subset of the sections) of the document- Parameters:
onlyRoot
- if equal to true, extract only the top level sections (h1)- Returns:
- Throws:
InternalProcessingException
-
extractSentences
List<Sentence> extractSentences(SentenceSelectorENUM sentenceSel) throws InternalProcessingException
Load the list of sentences of the document, ordered by their occurrence in the document. If sentences have not been extracted, the first time this method is executed the document text is split into sentences.- Parameters:
sentenceSel
- the type of sentence to select- Returns:
- the set of sentences in document order
- Throws:
InternalProcessingException
-
extractSentenceById
Sentence extractSentenceById(int sentenceId) throws InternalProcessingException
Get one of the sentences of the document by id- Parameters:
sentenceId
-- Returns:
- null if the sentence id is null or not a valid id
- Throws:
InternalProcessingException
-
extractTerminology
List<CandidateTermOcc> extractTerminology() throws DRIexception
Load the list of terms extracted from the document. If the terminology has not been extracted from the document, the first time this method is executed relevant terms are extracted from the document.- Returns:
- the set of sentences in document order
- Throws:
DRIexception
-
extractSummary
List<Sentence> extractSummary(int sentNumber, SummaryTypeENUM summaryType) throws InternalProcessingException
Generate a summary of the paper by selecting a relevant set of sentences. Sentences are ordered by their relevance in descending order.- Parameters:
sentNumber
- from 1 to 30summaryType
-- Returns:
- Throws:
InternalProcessingException
-
extractSentenceGraph
DependencyGraph extractSentenceGraph(int sentenceId, SentGraphTypeENUM graphType) throws DRIexception
Get the graph representing a sentence. The id of the sentence can be retrieved by the methodextractSentences()
NB: experimental sentence graphs merging approach implemented- Parameters:
sentenceId
-graphType
-- Returns:
- Throws:
DRIexception
-
extractDocumentGraph
DependencyGraph extractDocumentGraph(SentenceSelectorENUM sentenceSel) throws DRIexception
Get the graph representing a portion of a document. The nodes of the graph are merged by relying on co-reference chains.- Parameters:
sentenceSel
-- Returns:
- Throws:
DRIexception
-
extractCitations
List<Citation> extractCitations() throws InternalProcessingException
Get the list of citations extracted from the document.- Returns:
- Throws:
InternalProcessingException
-
resetDocumentExtractionData
void resetDocumentExtractionData() throws InternalProcessingException
This method deletes all the data extracted from the original document including sentences, terminology, citations, etc. After calling this method on aDocument
object, the next time sentences, terminology, citations, etc. from the document are accessed, they are extracted again and not read from the output of a previous extraction process execution.- Throws:
InternalProcessingException
-
getSourceDocumentType
SourceENUM getSourceDocumentType() throws InternalProcessingException
Get the original document type from which theDocument
instance has been created. The set of document types are the values ofSourceENUM
.- Returns:
- Throws:
InternalProcessingException
-
cleanUp
void cleanUp() throws InternalProcessingException
Call this method only WHEN YOU ARE SURE YOU WILL NOT USE THE DOCUMENT NO MORE IN YOUR DATA. This method will clean all the document data structures made the memory occupied by these data ready for garbage collection. Note that, if you try to access / call methods of the document after calling this method an exception will be raised to state that the resource has been already closed and its data cleaned.- Throws:
InternalProcessingException
-
isCleanUp
boolean isCleanUp() throws InternalProcessingException
Check if the document data structures has been cleaned by calling the cleanUp() method. A cleaned up document cannot be used no more; if you try to access / call methods of the document after calling this method an Exception will be raised to state that the resource has been already closed and its data cleaned.- Returns:
- true if the document data structures has been cleaned.
- Throws:
InternalProcessingException
-
-