spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

NER: Ensure zero-cost sequence with sentence split in entity

Open danieldk opened this issue 2 years ago • 2 comments

Description

If we use a sentence splitter as one of the annotating components during training, an entity can become split in the predicted Doc. Before this change, training would fail, because no zero-cost transition sequence could be found.

This fixes two scenarios:

  1. When the gold action is B and a split occurs after the current token, the BEGIN action is invalid. However, this was the only possible zero-cost action. This change makes OUT a zero-cost action in this case.
  2. When the gold action is I and a split occurs after the current token, the IN action is invalid, removing the only zero-cost action. This change makes LAST a zero-cost action, so that the entity can be properly closed.

Types of change

Bugfix

Checklist

  • [x] I confirm that I have the right to submit this contribution under the project's MIT license.
  • [x] I ran the tests, and all new and existing tests passed.
  • [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

danieldk avatar Mar 24 '23 14:03 danieldk

One thing that I am not very sure of: maybe in case 1, U should also be a zero-cost transition?

danieldk avatar Mar 24 '23 14:03 danieldk

One thing that I am not very sure of: maybe in case 1, U should also be a zero-cost transition?

U would introduce a new incorrect prediction, so it should have cost 1 as well. The cost is false_negatives + false_positives. So if we can't recover a correct entity we should give up on it and not return an entity that has partial overlap within it. And once we've started an entity that's incorrect, we can end it whenever --- so long as we're not making a gold entity unrecoverable.

honnibal avatar Jun 27 '23 10:06 honnibal