jigg
jigg copied to clipboard
Consistent error handling
Here is a proposal for how to keep track errors on the output XML when some errors are detected.
Example:
<chunks annotators="cabocha" errors="cabocha">
<error by="cabocha">error message</error>
</chunks>
That is, an error message is surrounded by <error>
, which keeps the annotator causing the error.
This design may handle the situation where multiple annotators annotate the same XML element and only one of them fails in annotation:
<tokens annotators="ssplit tokenize pos" errors="pos">
<token id="0" offsetBegin="0" offsetEnd="1">I</token>
...
<error by="pos">error message</error>
</tokens>
errors
attribute in each element may be redundant but seems useful to check errors. I'm not sure.
When a error is detected at higher level in the pipeline (e.g., tokenize), it seems natural that the lower level annotators (e.g., pos) annotate nothing and just ignore that sentence (or a document, if that contains sentences with errors).
Or the output keeps all <error>
tags for each annotator? This seems somewhat redundant.
One problem of this approach is that, e.g., <tokens>
has elements other than <token>
as a child.
Here is another proposal:
<sentence id="s0">
<tokens annotators="ssplit tokenize pos" errors="e0">
...
</tokens>
<erorrs>
<error id="e0" by="pos">...</error>
</errors>
</sentence>
Another merit of this approach is that we can refer to the same error message from different elements, e.g., chunks
, dependencies
, etc of knp
.
This is the final design now accepted in 038c85007174a68e49a188b53469a3876ed01bca.
<sentence id="s0">
<tokens .../>
<error annotator="knp">...</error>
</sentence>
We do not record error id, and also links between elements on which the error occurs and <error>
.
Basically each annotator is agnostic about annotating <error>
tag, and it is SentenceAnnotator
or DocumentAnnotator
that annotates <error>
for a problematic sentence or document.
In the current implementation, only AnnotationError
thrown in each annotator is caught, and is converted to <error>
tag. This might be changed to catch all errors during annotation?
This is a concrete example, which occurs when *
is given to knp and juman does not convert half space chars (-juman.normalize false
).
<root>
<document id="d0">
<sentences>
<sentence id="s0">
*
<tokens annotators="juman" normalized="false">
<token id="s0_tok0" form="*" characterOffsetBegin="0" characterOffsetEnd="1" yomi="*" lemma="*" pos="未定義語" posId="15" pos1="その他" pos1Id="1" cType="*" cTypeId="0" cForm="*" cFormId="0" misc="NIL"/>
</tokens>
<error annotator="knp">jigg.pipeline.ProcessError: ;; Invalid input <* * * 未定義語 15 その他 1 * 0 * 0 NIL > ! # S-ID:2 KNP:4.12-CF1.1 DATE:2016/03/16 SCORE:0.00000 ERROR:Cannot make mrph EOS</error>
</sentence>
</sentences>
</document>
</root>
Error message of KNP is recorded in the text of <error>
.
TODO: check whether error handling works correctly for CoreNLP. One issue is that now all (sub)annotators in CoreNLP are DocumentAnnotator, which means if some error (e.g., parse error) occurs on a sentence, probably the analysis of the whole document is failed. Or unexpected behavior may occur if some error is handled (e.g., giving too long sentences?) internally in some annotator of CoreNLP?