gateplugin-LearningFramework
gateplugin-LearningFramework copied to clipboard
Implement additional feature functions, like wordshape and character n-grams
Implement some standard feature functions: wordshape, character ngrams with maximum n or range of ns, prefixes or suffixes of length <= n. Where these features should also be usable in windows (ATTRLIST). In theory we could create these beforehand to be separate per instance features, but having the feature generation code do this is more convenient. A more complex functionality would be generating certain features only for rare instances (so making use of a pre-computed frequency table) (see Curran etal 2003, Language Independent NER Using a Maximum Entropy Tagger) Try to be compatible or create some features similar to what the Stanford feature factory does: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/NERFeatureFactory.html
If the feature function is an actual function of a value that is easily available from the original annotation, the function should be implemented as a static method that can be used from any client code. That way the feature function can be (pre)calculated in a separate step. However, some of the values from which to calculate the function may only be as a result of feature extraction, so we need a way to specify all these functions in the attribute definition.
Since LF can now make use of list and set and map valued features, even character ngrams can be pre-calculated, though doing it this way will probably blow up the memory required for each document considerably.
See also #48
Character ngrams should get calculated on the fly, specified by something lik <CHARNGRAM><NFROM>2</NFROM><NTO>4</NTO><ADDSTARTSTOP/></CHARNGRAM>