CoreNLP icon indicating copy to clipboard operation
CoreNLP copied to clipboard

Original character positions incorrect in Spanish annotation

Open hans opened this issue 9 years ago • 4 comments

@manning reported second-hand that character offset annotations seem to be wrong for Spanish NER output. Investigate.

hans avatar Sep 08 '16 19:09 hans

Fixed in 2b0c701b.

hans avatar Sep 20 '16 04:09 hans

As far as I can see this still isn't correct.

See the test testOffsetsSpacing() that I now have in SpanishTokenizerITest. It's checked in passing, but the answers seem to be wrong for postverbal clitics, no? See the todo items.

manning avatar Dec 04 '16 18:12 manning

https://github.com/stanfordnlp/CoreNLP/commit/528215fab335a318c4ce05bed004d29afc14afcc fixes the TODOs in the integration tests. I'll get a final integration test set up this week that verifies offset / spacing issues once and for all.

hans avatar Jan 09 '17 03:01 hans

There's still one leftover comment in there, which I've taken no effort to understand:

    // y de el y
    testOffsetsTextOriginalText("y del y", new int[] {0, 2, 3, 6}, new int[] {1, 3, 5, 7},
            new String[] { "y", "de", "el", "y"},
            new String[] { "y", "de", "el", "y"});  // todo [cdm 2017]: it's very unclear if this is what we actually want! Overlaps, concatenation doesn't work.
    // according to offsets, it should be "d" + "el"

What's the plan for this one or the token del in general?

AngledLuffa avatar Aug 13 '22 17:08 AngledLuffa