gateplugin-LearningFramework
gateplugin-LearningFramework copied to clipboard
Add START and STOP symbol representations to sequence learning problems.
This should be doable independently of the algorithm used, i.e. possible for both classifiers and sequence learners
cfa27377f4c48962ebf5a60a2951893073c55e9f adds START and STOP features for attribute lists if the within annotation type is specified. In this case an additional feature is generated if a list element starts where the within annotation starts or ends where the within annotation ends.
8d5d0ecd6789975ed25728c4d1e30e0ded49da91 adds START and STOP features for normal attributes if a within type is specified.
Note that START/STOP features on instances are different from START/STOP elements for sequence tagging: for sequence tagging, the START/STOP symbols must be separate instances in the instance list so that the probabilities for moving from START to the first and moving from the last to STOP can be calculated. Have to check if the CRF implementation of Mallet already does this correctly anyway.
Current START/STOP feature names cannot be mapped back to any feature specification when trying to export to ARFF, for example. Of course not, there is none. We need to return a dummy feature specification for "invented" features. The method FeatureExtraction.lookupAttributeForFeatureName needs to know about the features created by the LF itself.
The ARFF problem has been fixed for now by dealing with not getting anything from the reverse lookup of the specification separately if it is a START/STOP featuer. In that case, we just use the default numeric attribute which works.
Closing this for now on the assumption that at least START is handled correctly internally in the Mallet CRF. Reopen or open another bug if we find out that this is not the case.
Currently the STOP/START features are independent of the actual features from which they are generated. This means that the same feature can get created from several different attribute specifications (which also triggers a warning each time that the feature has already been set).
The current schema of the feature name is null|L-2|STOP
.. we should probably change this to something like null|L-2|origFeatureName|||STOP
where |||
stands for some separator that is different from the one used for value separation.