CoreNLP icon indicating copy to clipboard operation
CoreNLP copied to clipboard

invertible is incorrect for HTML tags

Open keto33 opened this issue 4 years ago • 4 comments

I am not sure if it is the intended behaviour, but it seems odd to me.

With a basic set of options

-annotators tokenize,cleanxml,ssplit,pos,lemma

parse the sentence

This is a <b>test</b> sentence.

the output is

{
  "sentences": [
    {
      "index": 0,
      "tokens": [
        {
          "index": 1,
          "word": "This",
          "originalText": "This",
          "lemma": "this",
          "characterOffsetBegin": 0,
          "characterOffsetEnd": 4,
          "pos": "DT",
          "before": "",
          "after": " "
        },
        {
          "index": 2,
          "word": "is",
          "originalText": "is",
          "lemma": "be",
          "characterOffsetBegin": 5,
          "characterOffsetEnd": 7,
          "pos": "VBZ",
          "before": " ",
          "after": " "
        },
        {
          "index": 3,
          "word": "a",
          "originalText": "a",
          "lemma": "a",
          "characterOffsetBegin": 8,
          "characterOffsetEnd": 9,
          "pos": "DT",
          "before": " ",
          "after": "  <b>"
        },
        {
          "index": 4,
          "word": "test",
          "originalText": "test",
          "lemma": "test",
          "characterOffsetBegin": 13,
          "characterOffsetEnd": 17,
          "pos": "NN",
          "before": " <b>",
          "after": "</b>"
        },
        {
          "index": 5,
          "word": "sentence",
          "originalText": "sentence",
          "lemma": "sentence",
          "characterOffsetBegin": 22,
          "characterOffsetEnd": 30,
          "pos": "NN",
          "before": "</b> ",
          "after": ""
        },
        {
          "index": 6,
          "word": ".",
          "originalText": ".",
          "lemma": ".",
          "characterOffsetBegin": 30,
          "characterOffsetEnd": 31,
          "pos": ".",
          "before": "",
          "after": ""
        }
      ]
    }
  ],
  "sections": [
  ]
}

For index #3, the after element is " <b>" (two spaces). The previous character offset is 9 and the current one is 13, which means after element should be 4 characters, not 5.

Similarly, for index #5, the before element should be 5 characters, not 4 to match the character offsets.

Test with version 4.3.1.

keto33 avatar Oct 22 '21 01:10 keto33

I actually disagree regarding #5: the before text is </b> , 5 characters, and that is exactly what lies between test and sentence.

The AfterAnnotation is definitely wrong, though. I just made a PR with a request for that issue and will merge it if the tests pass.

AngledLuffa avatar Oct 22 '21 21:10 AngledLuffa

I pointed to the wrong index. "</b> " is correct for before in index #5, but it is incorrect as after in index #4.

keto33 avatar Oct 22 '21 23:10 keto33

https://github.com/stanfordnlp/CoreNLP/pull/1214

AngledLuffa avatar Oct 23 '21 00:10 AngledLuffa

Thanks, that makes more sense. I can fix that too. Thanks for finding these!

AngledLuffa avatar Oct 23 '21 00:10 AngledLuffa