Class TFIDFVectorWiki


  • public class TFIDFVectorWiki
    extends Object
    Utility class to compute TD-IDF vectors from textual excerpts.
    • Field Detail

      • onlyWordKind

        public boolean onlyWordKind
      • getLemma

        public boolean getLemma
      • toLowerCase

        public boolean toLowerCase
      • removeStopWords

        public boolean removeStopWords
      • appendPOS

        public boolean appendPOS
      • stopWordsList

        public Set<String> stopWordsList
    • Method Detail

      • computeTFIDFvect

        public Map<String,​Double> computeTFIDFvect​(gate.Annotation ann,
                                                         gate.Document doc)
      • cosSimTFIDF

        public double cosSimTFIDF​(Map<String,​Double> tokenDoc1,
                                  Map<String,​Double> tokenDoc2)
        Compute the TF IDF similarity among the two token lists Term frequency of a sentence: number of times the token appears in the sentence Inverse document frequency: logarithm of the total number of documents divided by the number of docs in which the token appears
        Parameters:
        tokenSent1 -
        tokenSetn2 -
        Returns:
      • cosSimTFIDF

        public double cosSimTFIDF​(gate.Annotation ann1,
                                  gate.Document doc1,
                                  gate.Annotation ann2,
                                  gate.Document doc2)
        Compute the TF IDF similarity among the two document annotations Term frequency of a sentence: number of times the token appears in the sentence Inverse document frequency: logarithm of the total number of documents divided by the number of docs in which the token appears
        Parameters:
        ann1 -
        doc1 -
        ann2 -
        doc2 -
        Returns:
      • extractTokenList

        public static List<String> extractTokenList​(gate.Annotation ann,
                                                    gate.Document doc,
                                                    TokenFilterInterface tokenFilter,
                                                    boolean onlyWordKind,
                                                    boolean getLemma,
                                                    boolean toLowerCase,
                                                    boolean appendPOS,
                                                    boolean removeStopWords,
                                                    Set<String> stopWordsList)
        Given an annotation of a TDDocument, extract the list of tokens (eventually repeated in case of multiple occurrences)
        Parameters:
        ann -
        doc -
        onlyWordKind -
        getLemma -
        toLowerCase -
        removeStopWords -
        Returns: