python-gatenlp icon indicating copy to clipboard operation
python-gatenlp copied to clipboard

Implement conll-like IBO and other chunk formats

Open johann-petrak opened this issue 4 years ago • 1 comments

File format conversion: conll to our annotations:

  • basic conll 2003 format: token / whitespace / IBO-code
    • IBO-code is something like I-PER, B-LOC, O
  • use heuristics/rules for whitespace insertion
    • space between tokens except before punctuation and after opening parentheses, before closing parentheses
  • also support other codes:
    • BIOSE (S=start/E=end)
    • BIOES/BILOU etc with S=single, E=ending, L=last
  • support multiple code columns or multiple comma-separated codes per token
  • check out alternate formats where multiple same-type chunks can overlap (cf genia corpus)

See also: https://lingpipe-blog.com/2009/10/14/coding-chunkers-as-taggers-io-bio-bmewo-and-bmewo/

See also: https://github.com/GateNLP/corpusconversion-conll2003

johann-petrak avatar Feb 21 '21 13:02 johann-petrak

Conversion from document annotations to IBO and back:

  • initially only support for token-aligned annotations, non-overlapping
  • running ann2ibo sets a feature on each token with the ibo label(s) or sets several features, one for each type
    • do this in two steps: 1 create a list of labels or label lists or lists of labels which correspond to a list of token annotations 2 - apply the labels from the list(s) to the annotations (optional, if we use this e.g. for training, only the first step is needed)
  • running ibo2ann takes a list or lists of labels and applies them to a list of annotations

johann-petrak avatar Jun 22 '22 17:06 johann-petrak