spaCy
spaCy copied to clipboard
Reset entity-related attributes when an entity is removed
Description
If you have entities on a Doc, and filter the list to remove some entities, and then set the list of entities, ent_id
and ent_kb_id
values are not reset. This means that you can get non-entity tokens with those values, for example, which seems undesirable.
This came up in #10555.
This works as intended but is in draft because
- We need to discuss if this is desired behavior
- Docs probably need an update
Types of change
bug fix?
Checklist
- [x] I confirm that I have the right to submit this contribution under the project's MIT license.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
Linking a related comment: https://github.com/explosion/spaCy/pull/9880#issuecomment-1020243254
Commit: https://github.com/explosion/spaCy/pull/9880/commits/6d1ce787932aa6687487ffd05fa709745ccff8c9#diff-7a0a537f9cec6a99a8c1ccc23167393b089cb17926b1432dac0e02bd33a4b3ee
I worry that this may be too breaking for v3. In general I do think it makes sense to consider updating Doc.set_ents
so that it goes through the whole doc to make all token.ent_
attributes consistent (make ent_*
consistent for all tokens within each provided span, clear all features in O
cases, etc.).
But even just setting token.ent_id
within entity spans more consistently in the span ruler PR broke our own code in the entity ruler. Anyone who's doing this incrementally because you couldn't set all the features before with doc.ents
may have code that breaks.
Following internal discussion, it was pointed out that with some major changes to the way Span
s work in v4 this approach alone won't be sufficient for ensuring consistency. I'm going to think about this more, but it might end up best to close this before working on a more complete approach.
See the proposal in #11328
Closing in favour of https://github.com/explosion/spaCy/pull/11328 which targets the v4
branch - let's continue discussion over there!