gateplugin-LearningFramework
gateplugin-LearningFramework copied to clipboard
Rethink API for sequence encoder and implement a few more
Currently the sequence encoding is really done by the feature extractor and the sequence encoder jointly, and the sequence encoder only sees the class annotations for each instance separately.
We should move the full functionality into the sequence encoder and also make sure that all encoding strategies we want to support get all the date they need, which may include the class annotations from the previous instance, the labels generated for the previous instance or even a completely different approach.
To figure this out, start implementing a number of commonly used sequence encoding strategies and think about how to deal with overlapping class annotations for the same or different classes.
Schemes to consider to implement:
- BIO-1, or just "BIO"
- BIO-2 I think this is IO encoding where B is only used if the previous tag was an I, otherwise I is used for starting as well
- BMEWO: E=end of entity, M=mid of entity, W=single token
- BMEWO+: (see http://alias-i.com/lingpipe/docs/api/com/aliasi/chunk/HmmChunker.html)
- BILOU: L=last, U=unit/single
- BIOES: I think this is identical to BILOU, E=End, S=Single, so L=E, S=U
Ratinov and Roth (2009) "Design challenges and misconceptions in named entityrecognition." CoNLL. has an evaluation of different approaches.
However, should also check which scheme is best suited for encoding arbitrarily or constrained overlapping entity annotations! Check out approaches/evaluations on Genia corpus!