spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Removing (token level) entity information from doc.

Open galtay opened this issue 3 years ago • 2 comments

How to reproduce the behaviour

EntityRuler is run as in example from docs here https://spacy.io/usage/rule-based-matching#entityruler-ent-ids

Suppose someone has a series of pipeline components that run after some entities are added. These components may want to examine the token level entity attributes to decide where to add/remove entities. However, it is difficult to make the token level entity attributes consistent with the doc.ents tuple. The main questions here are,

When removing an entity from a document,

  • is there a standard way to remove all the token level entity information (to make token level information consistent with doc.ents)?
  • it seems that all token level attributes have a setter method except the iob attrs, is it possible to change this?.
import spacy

TOK_ATTRS = [
    "ent_id", "ent_id_",
    "ent_kb_id", "ent_kb_id_",
    "ent_type", "ent_type_",
    "ent_iob", "ent_iob_",
]

TOK_VALS = [
    0, "",
    0, "",
    0, "",
    2, "O",
]



def show_toks(doc):
    for tok in doc:
        to_print = ["{}={}".format(attr, repr(getattr(tok, attr))) for attr in TOK_ATTRS]
        print("tok={}, {}".format(tok, ', '.join(to_print)))


def remove_ent(doc):
    new_ents = []
    for ent in doc.ents:

        # do not include this ent in new_ents
        # set token level attributes back to their defaults
        if ent.label_ == "GPE":
            for tok in doc[ent.start: ent.end]:
                for attr, val in zip(TOK_ATTRS, TOK_VALS):
                    setattr(tok, attr, val)

        # keep entity
        else:
            new_ents.append(ent)

    doc.ents = tuple(new_ents)


nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple-id"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}], "id": "sf-id"}]
ruler.add_patterns(patterns)

doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])
show_toks(doc)
print()
remove_ent(doc)
print([(ent.text, ent.label_) for ent in doc.ents])
show_toks(doc)

Your Environment

  • spaCy version: 3.2.3
  • Platform: macOS-10.16-x86_64-i386-64bit
  • Python version: 3.8.12

galtay avatar Mar 26 '22 16:03 galtay

There aren't setters for the IOB attrs precisely because keeping them consistent with doc.ents is tricky. The recommended way to deal with this is to modify the list of entities and set doc.ents, which should handle consistency of the token attributes behind the scenes. However, as your code sample reveals, while ent_iob is handled correctly, it looks like some other attributes are not reset if you remove an entity.

As-is your code doesn't run, but if I modify it to just remove GPE entities from the list and set the list again this is the output:

[('Apple', 'ORG'), ('San Francisco', 'GPE')]
tok=Apple, ent_id=3197271685619048373, ent_id_='apple-id', ent_kb_id=0, ent_kb_id_='', ent_type=383, ent_type_='ORG', ent_iob=3, ent_iob_='B'
tok=is, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=opening, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=its, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=first, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=big, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=office, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=in, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=San, ent_id=14866658854433846679, ent_id_='sf-id', ent_kb_id=0, ent_kb_id_='', ent_type=384, ent_type_='GPE', ent_iob=3, ent_iob_='B'
tok=Francisco, ent_id=14866658854433846679, ent_id_='sf-id', ent_kb_id=0, ent_kb_id_='', ent_type=384, ent_type_='GPE', ent_iob=1, ent_iob_='I'
tok=., ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'

[('Apple', 'ORG')]
tok=Apple, ent_id=3197271685619048373, ent_id_='apple-id', ent_kb_id=0, ent_kb_id_='', ent_type=383, ent_type_='ORG', ent_iob=3, ent_iob_='B'
tok=is, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=opening, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=its, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=first, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=big, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=office, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=in, ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=San, ent_id=14866658854433846679, ent_id_='sf-id', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=Francisco, ent_id=14866658854433846679, ent_id_='sf-id', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'
tok=., ent_id=0, ent_id_='', ent_kb_id=0, ent_kb_id_='', ent_type=0, ent_type_='', ent_iob=2, ent_iob_='O'

Specifically it looks like ent_id retains its value even if the entity is gone.

I think this is an oversight in the entity setting code. Thanks for pointing it out!

polm avatar Mar 27 '22 05:03 polm

Edited: sorry, this was intended to be a comment on the PR instead of the issue.

I worry that this may be too breaking for v3. In general I do think it makes sense to consider updating Doc.set_ents so that it goes through the whole doc to make all token.ent_ attributes consistent (make ent_* consistent for all tokens within each provided span, clear all features in O cases, etc.).

But even just setting token.ent_id within entity spans more consistently in the span ruler PR broke our own code in the entity ruler. Anyone who's doing this incrementally because you couldn't set all the features before with doc.ents may have code that breaks.

adrianeboyd avatar Apr 01 '22 06:04 adrianeboyd

I think that this has been resolved by #11328, which will be in spacy v4.

adrianeboyd avatar Oct 28 '22 12:10 adrianeboyd

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] avatar Nov 05 '22 00:11 github-actions[bot]

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] avatar Dec 06 '22 00:12 github-actions[bot]