Validate better during import
Is your feature request related to a problem? Please describe. It is relatively easy to import data which breaks assumptions made in the code (e.g. during rendering).
Describe the solution you'd like CAS Doctor should be applied during import with all default rules.
Additional rules may have to be added for
- A sentence needs to begin at the same position as the first token in the sentence.
- A sentence needs to end at the same position as the last token in the sentence.
- Sentences and tokens must not start or end with whitespace.
Additional context See also: Index out of range using WebAnno 3.2 with leading space #2060
Does this apply when using character based annotation? If all your span layers are character (or sentence?) based, from my experiments you don't need tokens (in an XMI file), so any validation that sentences align with tokens will reject files that at the moment at least appear valid - or should I be creating character tokens?
I want to do an annotation run in character mode in order to extract all the necessary spans, so that I can test that my tokenisation is valid (i.e. that none of my annotation boundaries fall inside tokens). I hope to use this as input to a Unigram model to generate a tokeniser and then use that for full annotation.
During import, if there are no tokens or sentences, they will be created.
Note that at the moment, if the imported data contains only tokens but no sentences, sentences will be created but it is then possible that they do not align with the token boundaries. That's something that needs fixing as well.
If sentences exist, tokens are created inside the sentences.
If neither tokens nor sentences exist, first sentences are created, then tokens.
If you set a layer to "character" granularity, you can anchor the annotations anywhere on text. In the brat view, it is not possible to annotate only the space between tokens. The begin / end of an annotation in brat mode must always be at the edge or within a token - same for WebAnno TSV. Other visualization modes and XMI may not be subject to such restrictions - but it is a good idea to remain compatible with WebAnno TSV and the brat view.