CoreNLP
CoreNLP copied to clipboard
Original character positions incorrect in Spanish annotation
@manning reported second-hand that character offset annotations seem to be wrong for Spanish NER output. Investigate.
Fixed in 2b0c701b.
As far as I can see this still isn't correct.
See the test testOffsetsSpacing() that I now have in SpanishTokenizerITest. It's checked in passing, but the answers seem to be wrong for postverbal clitics, no? See the todo items.
https://github.com/stanfordnlp/CoreNLP/commit/528215fab335a318c4ce05bed004d29afc14afcc fixes the TODOs in the integration tests. I'll get a final integration test set up this week that verifies offset / spacing issues once and for all.
There's still one leftover comment in there, which I've taken no effort to understand:
// y de el y
testOffsetsTextOriginalText("y del y", new int[] {0, 2, 3, 6}, new int[] {1, 3, 5, 7},
new String[] { "y", "de", "el", "y"},
new String[] { "y", "de", "el", "y"}); // todo [cdm 2017]: it's very unclear if this is what we actually want! Overlaps, concatenation doesn't work.
// according to offsets, it should be "d" + "el"
What's the plan for this one or the token del in general?