spaCy
spaCy copied to clipboard
Removing (token level) entity information from doc.
How to reproduce the behaviour
EntityRuler is run as in example from docs here https://spacy.io/usage/rule-based-matching#entityruler-ent-ids
Suppose someone has a series of pipeline components that run after some entities are added. These components may want to examine the token level entity attributes to decide where to add/remove entities. However, it is difficult to make the token level entity attributes consistent with the doc.ents tuple. The main questions here are,
When removing an entity from a document,
- is there a standard way to remove all the token level entity information (to make token level information consistent with
doc.ents)? - it seems that all token level attributes have a setter method except the iob attrs, is it possible to change this?.
import spacy
TOK_ATTRS = [
"ent_id", "ent_id_",
"ent_kb_id", "ent_kb_id_",
"ent_type", "ent_type_",
"ent_iob", "ent_iob_",
]
TOK_VALS = [
0, "",
0, "",
0, "",
2, "O",
]
def show_toks(doc):
for tok in doc:
to_print = ["{}={}".format(attr, repr(getattr(tok, attr))) for attr in TOK_ATTRS]
print("tok={}, {}".format(tok, ', '.join(to_print)))
def remove_ent(doc):
new_ents = []
for ent in doc.ents:
# do not include this ent in new_ents
# set token level attributes back to their defaults
if ent.label_ == "GPE":
for tok in doc[ent.start: ent.end]:
for attr, val in zip(TOK_ATTRS, TOK_VALS):
setattr(tok, attr, val)
# keep entity
else:
new_ents.append(ent)
doc.ents = tuple(new_ents)
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple-id"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}], "id": "sf-id"}]
ruler.add_patterns(patterns)
doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])
show_toks(doc)
print()
remove_ent(doc)
print([(ent.text, ent.label_) for ent in doc.ents])
show_toks(doc)
Your Environment
- spaCy version: 3.2.3
- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.12
There aren't setters for the IOB attrs precisely because keeping them consistent with doc.ents is tricky. The recommended way to deal with this is to modify the list of entities and set doc.ents, which should handle consistency of the token attributes behind the scenes. However, as your code sample reveals, while ent_iob is handled correctly, it looks like some other attributes are not reset if you remove an entity.
As-is your code doesn't run, but if I modify it to just remove GPE entities from the list and set the list again this is the output:
[('Apple', 'ORG'), ('San Francisco', 'GPE')]
tok=Apple, ent_id=3197271685619048373, ent_id_='apple-id', ent_kb_id=0, ent_kb_id_='', ent_type=383, ent_type_='ORG', ent_iob=3, ent_iob_='B'
tok=is, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=opening, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=its, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=first, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=big, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=office, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=in, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=San, ent_id=14866658854433846679, ent_id_='sf-id', ent_kb_id=0, ent_kb_id_='', ent_type=384, ent_type_='GPE', ent_iob=3, ent_iob_='B'
tok=Francisco, ent_id=14866658854433846679, ent_id_='sf-id', ent_kb_id=0, ent_kb_id_='', ent_type=384, ent_type_='GPE', ent_iob=1, ent_iob_='I'
tok=., ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
[('Apple', 'ORG')]
tok=Apple, ent_id=3197271685619048373, ent_id_='apple-id', ent_kb_id=0, ent_kb_id_='', ent_type=383, ent_type_='ORG', ent_iob=3, ent_iob_='B'
tok=is, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=opening, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=its, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=first, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=big, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=office, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=in, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=San, ent_id=14866658854433846679, ent_id_='sf-id', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=Francisco, ent_id=14866658854433846679, ent_id_='sf-id', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=., ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
Specifically it looks like ent_id retains its value even if the entity is gone.
I think this is an oversight in the entity setting code. Thanks for pointing it out!
Edited: sorry, this was intended to be a comment on the PR instead of the issue.
I worry that this may be too breaking for v3. In general I do think it makes sense to consider updating Doc.set_ents so that it goes through the whole doc to make all token.ent_ attributes consistent (make ent_* consistent for all tokens within each provided span, clear all features in O cases, etc.).
But even just setting token.ent_id within entity spans more consistently in the span ruler PR broke our own code in the entity ruler. Anyone who's doing this incrementally because you couldn't set all the features before with doc.ents may have code that breaks.
I think that this has been resolved by #11328, which will be in spacy v4.
This issue has been automatically closed because it was answered and there was no follow-up discussion.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.