gateplugin-LearningFramework icon indicating copy to clipboard operation
gateplugin-LearningFramework copied to clipboard

Add START and STOP symbol representations to sequence learning problems.

Open johann-petrak opened this issue 7 years ago • 7 comments

This should be doable independently of the algorithm used, i.e. possible for both classifiers and sequence learners

johann-petrak avatar Jul 24 '17 09:07 johann-petrak

cfa27377f4c48962ebf5a60a2951893073c55e9f adds START and STOP features for attribute lists if the within annotation type is specified. In this case an additional feature is generated if a list element starts where the within annotation starts or ends where the within annotation ends.

johann-petrak avatar Aug 07 '17 19:08 johann-petrak

8d5d0ecd6789975ed25728c4d1e30e0ded49da91 adds START and STOP features for normal attributes if a within type is specified.

johann-petrak avatar Aug 08 '17 13:08 johann-petrak

Note that START/STOP features on instances are different from START/STOP elements for sequence tagging: for sequence tagging, the START/STOP symbols must be separate instances in the instance list so that the probabilities for moving from START to the first and moving from the last to STOP can be calculated. Have to check if the CRF implementation of Mallet already does this correctly anyway.

johann-petrak avatar Aug 08 '17 13:08 johann-petrak

Current START/STOP feature names cannot be mapped back to any feature specification when trying to export to ARFF, for example. Of course not, there is none. We need to return a dummy feature specification for "invented" features. The method FeatureExtraction.lookupAttributeForFeatureName needs to know about the features created by the LF itself.

johann-petrak avatar Aug 08 '17 20:08 johann-petrak

The ARFF problem has been fixed for now by dealing with not getting anything from the reverse lookup of the specification separately if it is a START/STOP featuer. In that case, we just use the default numeric attribute which works.

johann-petrak avatar Aug 10 '17 09:08 johann-petrak

Closing this for now on the assumption that at least START is handled correctly internally in the Mallet CRF. Reopen or open another bug if we find out that this is not the case.

johann-petrak avatar Aug 10 '17 09:08 johann-petrak

Currently the STOP/START features are independent of the actual features from which they are generated. This means that the same feature can get created from several different attribute specifications (which also triggers a warning each time that the feature has already been set). The current schema of the feature name is null|L-2|STOP .. we should probably change this to something like null|L-2|origFeatureName|||STOP where ||| stands for some separator that is different from the one used for value separation.

johann-petrak avatar Aug 10 '17 20:08 johann-petrak