SUMMA NEs Statistics
Functionality
Adds to each token (or specified annotation) in the document the following statistics:
- Number of times the token appears in the document, sentence, and paragraph (if they are present in the document)
- The token inverted document frequency
- The token term frequency*inverted document frequency
Parameters of the Resource
- annSet: the annotation set where the annotations live
- annType: the name of the annotations you want the statistics for (e.g. Token)
- featureName: the feature you want to use for computing the statistics (e.g. string of the Token)
- kindF: a feature in your annotation to restrict the types of annotations to consider (e.g. you may want consider only words and not numbers for computing statistics. For example kind of the Token)
- kindV: the value to restrict the computation of statistics (e.g. word for the kind of the Token)
- parAnn: the annotation representing the paragraph
- sentAnn: the annotation representing the sentence
- sentStat: the name (prefix) of the feature for the sentence statistics
- paraStat: the name (prefix) of the featutre for the paragraph statistics
- tokenStat: the name (prefix) of the feature for the document statitsics (if this feature is 'token' then 'token', 'token_idf', and 'token_tf_idf' will be created with appropriate values in them)
- table: a SUMMA IDF table that must be loaded before running this component
Restriction
The document should have the annotations and features needed for it to correctly work. The table of statistics that you use needs to be computed from similar annotations to those you want your statistics computed, i.e. if you want to compute "Token" statistics, then yout IDF table should be one with Token statistics in it.