Class TFIDFVectorWiki
- java.lang.Object
-
- edu.upf.taln.dri.module.summary.util.similarity.TFIDFVectorWiki
-
public class TFIDFVectorWiki extends Object
Utility class to compute TD-IDF vectors from textual excerpts.
-
-
Field Summary
Fields Modifier and Type Field Description boolean
appendPOS
boolean
getLemma
boolean
onlyWordKind
boolean
removeStopWords
Set<String>
stopWordsList
boolean
toLowerCase
-
Constructor Summary
Constructors Constructor Description TFIDFVectorWiki(SimLangENUM langIN)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description Map<String,Double>
computeTFIDFvect(gate.Annotation ann, gate.Document doc)
double
cosSimTFIDF(gate.Annotation ann1, gate.Document doc1, gate.Annotation ann2, gate.Document doc2)
Compute the TF IDF similarity among the two document annotations Term frequency of a sentence: number of times the token appears in the sentence Inverse document frequency: logarithm of the total number of documents divided by the number of docs in which the token appearsdouble
cosSimTFIDF(Map<String,Double> tokenDoc1, Map<String,Double> tokenDoc2)
Compute the TF IDF similarity among the two token lists Term frequency of a sentence: number of times the token appears in the sentence Inverse document frequency: logarithm of the total number of documents divided by the number of docs in which the token appearsstatic List<String>
extractTokenList(gate.Annotation ann, gate.Document doc, TokenFilterInterface tokenFilter, boolean onlyWordKind, boolean getLemma, boolean toLowerCase, boolean appendPOS, boolean removeStopWords, Set<String> stopWordsList)
Given an annotation of a TDDocument, extract the list of tokens (eventually repeated in case of multiple occurrences)
-
-
-
Constructor Detail
-
TFIDFVectorWiki
public TFIDFVectorWiki(SimLangENUM langIN) throws InvalidParameterException
- Throws:
InvalidParameterException
-
-
Method Detail
-
computeTFIDFvect
public Map<String,Double> computeTFIDFvect(gate.Annotation ann, gate.Document doc)
-
cosSimTFIDF
public double cosSimTFIDF(Map<String,Double> tokenDoc1, Map<String,Double> tokenDoc2)
Compute the TF IDF similarity among the two token lists Term frequency of a sentence: number of times the token appears in the sentence Inverse document frequency: logarithm of the total number of documents divided by the number of docs in which the token appears- Parameters:
tokenSent1
-tokenSetn2
-- Returns:
-
cosSimTFIDF
public double cosSimTFIDF(gate.Annotation ann1, gate.Document doc1, gate.Annotation ann2, gate.Document doc2)
Compute the TF IDF similarity among the two document annotations Term frequency of a sentence: number of times the token appears in the sentence Inverse document frequency: logarithm of the total number of documents divided by the number of docs in which the token appears- Parameters:
ann1
-doc1
-ann2
-doc2
-- Returns:
-
extractTokenList
public static List<String> extractTokenList(gate.Annotation ann, gate.Document doc, TokenFilterInterface tokenFilter, boolean onlyWordKind, boolean getLemma, boolean toLowerCase, boolean appendPOS, boolean removeStopWords, Set<String> stopWordsList)
Given an annotation of a TDDocument, extract the list of tokens (eventually repeated in case of multiple occurrences)- Parameters:
ann
-doc
-onlyWordKind
-getLemma
-toLowerCase
-removeStopWords
-- Returns:
-
-