spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Reset entity-related attributes when an entity is removed

Open polm opened this issue 2 years ago • 2 comments

Description

If you have entities on a Doc, and filter the list to remove some entities, and then set the list of entities, ent_id and ent_kb_id values are not reset. This means that you can get non-entity tokens with those values, for example, which seems undesirable.

This came up in #10555.

This works as intended but is in draft because

  1. We need to discuss if this is desired behavior
  2. Docs probably need an update

Types of change

bug fix?

Checklist

  • [x] I confirm that I have the right to submit this contribution under the project's MIT license.
  • [ ] I ran the tests, and all new and existing tests passed.
  • [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

polm avatar Mar 27 '22 06:03 polm

Linking a related comment: https://github.com/explosion/spaCy/pull/9880#issuecomment-1020243254

Commit: https://github.com/explosion/spaCy/pull/9880/commits/6d1ce787932aa6687487ffd05fa709745ccff8c9#diff-7a0a537f9cec6a99a8c1ccc23167393b089cb17926b1432dac0e02bd33a4b3ee

adrianeboyd avatar Mar 28 '22 06:03 adrianeboyd

I worry that this may be too breaking for v3. In general I do think it makes sense to consider updating Doc.set_ents so that it goes through the whole doc to make all token.ent_ attributes consistent (make ent_* consistent for all tokens within each provided span, clear all features in O cases, etc.).

But even just setting token.ent_id within entity spans more consistently in the span ruler PR broke our own code in the entity ruler. Anyone who's doing this incrementally because you couldn't set all the features before with doc.ents may have code that breaks.

adrianeboyd avatar Apr 01 '22 07:04 adrianeboyd

Following internal discussion, it was pointed out that with some major changes to the way Spans work in v4 this approach alone won't be sufficient for ensuring consistency. I'm going to think about this more, but it might end up best to close this before working on a more complete approach.

polm avatar Aug 17 '22 09:08 polm

See the proposal in #11328

adrianeboyd avatar Aug 17 '22 09:08 adrianeboyd

Closing in favour of https://github.com/explosion/spaCy/pull/11328 which targets the v4 branch - let's continue discussion over there!

svlandeg avatar Aug 17 '22 14:08 svlandeg