CoreNLP
CoreNLP copied to clipboard
invertible is incorrect for HTML tags
I am not sure if it is the intended behaviour, but it seems odd to me.
With a basic set of options
-annotators tokenize,cleanxml,ssplit,pos,lemma
parse the sentence
This is a <b>test</b> sentence.
the output is
{
"sentences": [
{
"index": 0,
"tokens": [
{
"index": 1,
"word": "This",
"originalText": "This",
"lemma": "this",
"characterOffsetBegin": 0,
"characterOffsetEnd": 4,
"pos": "DT",
"before": "",
"after": " "
},
{
"index": 2,
"word": "is",
"originalText": "is",
"lemma": "be",
"characterOffsetBegin": 5,
"characterOffsetEnd": 7,
"pos": "VBZ",
"before": " ",
"after": " "
},
{
"index": 3,
"word": "a",
"originalText": "a",
"lemma": "a",
"characterOffsetBegin": 8,
"characterOffsetEnd": 9,
"pos": "DT",
"before": " ",
"after": " <b>"
},
{
"index": 4,
"word": "test",
"originalText": "test",
"lemma": "test",
"characterOffsetBegin": 13,
"characterOffsetEnd": 17,
"pos": "NN",
"before": " <b>",
"after": "</b>"
},
{
"index": 5,
"word": "sentence",
"originalText": "sentence",
"lemma": "sentence",
"characterOffsetBegin": 22,
"characterOffsetEnd": 30,
"pos": "NN",
"before": "</b> ",
"after": ""
},
{
"index": 6,
"word": ".",
"originalText": ".",
"lemma": ".",
"characterOffsetBegin": 30,
"characterOffsetEnd": 31,
"pos": ".",
"before": "",
"after": ""
}
]
}
],
"sections": [
]
}
For index #3, the after element is " <b>" (two spaces). The previous character offset is 9 and the current one is 13, which means after element should be 4 characters, not 5.
Similarly, for index #5, the before element should be 5 characters, not 4 to match the character offsets.
Test with version 4.3.1.
I actually disagree regarding #5: the before text is </b> , 5 characters, and that is exactly what lies between test and sentence.
The AfterAnnotation is definitely wrong, though. I just made a PR with a request for that issue and will merge it if the tests pass.
I pointed to the wrong index. "</b> " is correct for before in index #5, but it is incorrect as after in index #4.
https://github.com/stanfordnlp/CoreNLP/pull/1214
Thanks, that makes more sense. I can fix that too. Thanks for finding these!