python-gatenlp
python-gatenlp copied to clipboard
Implement conll-like IBO and other chunk formats
File format conversion: conll to our annotations:
- basic conll 2003 format: token / whitespace / IBO-code
- IBO-code is something like I-PER, B-LOC, O
- use heuristics/rules for whitespace insertion
- space between tokens except before punctuation and after opening parentheses, before closing parentheses
- also support other codes:
- BIOSE (S=start/E=end)
- BIOES/BILOU etc with S=single, E=ending, L=last
- support multiple code columns or multiple comma-separated codes per token
- check out alternate formats where multiple same-type chunks can overlap (cf genia corpus)
See also: https://lingpipe-blog.com/2009/10/14/coding-chunkers-as-taggers-io-bio-bmewo-and-bmewo/
See also: https://github.com/GateNLP/corpusconversion-conll2003
Conversion from document annotations to IBO and back:
- initially only support for token-aligned annotations, non-overlapping
- running ann2ibo sets a feature on each token with the ibo label(s) or sets several features, one for each type
- do this in two steps: 1 create a list of labels or label lists or lists of labels which correspond to a list of token annotations 2 - apply the labels from the list(s) to the annotations (optional, if we use this e.g. for training, only the first step is needed)
- running ibo2ann takes a list or lists of labels and applies them to a list of annotations