spaCy
spaCy copied to clipboard
NER: Ensure zero-cost sequence with sentence split in entity
Description
If we use a sentence splitter as one of the annotating components during training, an entity can become split in the predicted Doc. Before this change, training would fail, because no zero-cost transition sequence could be found.
This fixes two scenarios:
- When the gold action is
Band a split occurs after the current token, theBEGINaction is invalid. However, this was the only possible zero-cost action. This change makesOUTa zero-cost action in this case. - When the gold action is
Iand a split occurs after the current token, theINaction is invalid, removing the only zero-cost action. This change makesLASTa zero-cost action, so that the entity can be properly closed.
Types of change
Bugfix
Checklist
- [x] I confirm that I have the right to submit this contribution under the project's MIT license.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
One thing that I am not very sure of: maybe in case 1, U should also be a zero-cost transition?
One thing that I am not very sure of: maybe in case 1, U should also be a zero-cost transition?
U would introduce a new incorrect prediction, so it should have cost 1 as well. The cost is false_negatives + false_positives. So if we can't recover a correct entity we should give up on it and not return an entity that has partial overlap within it. And once we've started an entity that's incorrect, we can end it whenever --- so long as we're not making a gold entity unrecoverable.