spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Doc span group spans aren't adjusted for retokenization

Open kinghuang opened this issue 2 years ago • 4 comments
trafficstars

When a Doc object is retokenized, the entity spans in Doc.ents reflect the new token alignment, but the spans in Doc.spans retain the original doc's token indexes, leading to unexpected spans.

How to reproduce the behaviour

Here is a modification of the example for Retokenizer.split that stores NewYork as an entity and in a span group before retokenization.

Split Example Code:

import spacy
from spacy.tokens import Span

nlp = spacy.blank("en")

doc = nlp("I live in NewYork")
ny = Span(doc, 3, 4, label="CITY")
doc.ents = [ny]
doc.spans["spans"] = [ny]

with doc.retokenize() as retokenizer:
    heads = [(doc[3], 1), doc[2]]
    attrs = {"POS": ["PROPN", "PROPN"],
             "DEP": ["pobj", "compound"]}
    retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)

Output:

>>> print(doc.ents)
(NewYork,)
>>> print(doc.spans["spans"])
[New]

Notice how the entity span is still NewYork, but the span group span is now just New.

In the case of a merge, spans can be lost altogether by exceeding the range of the retokenized doc. Here is a modification of the example for Retokenizer.merge.

Merge Example Code:

doc = nlp("I like David Bowie")
db = Span(doc, 2, 4, label="PERSON")
doc.ents = [db]
doc.spans["spans"] = [db]

with doc.retokenize() as retokenizer:
    attrs = {"LEMMA": "David Bowie"}
    retokenizer.merge(doc[2:4], attrs=attrs)

print(doc.spans["spans"])
print(doc.ents)

Output:

>>> print(doc.ents)
(David Bowie,)
>>> print(doc.spans["spans"])
[]

The range of the original David Bowie span is no longer valid and disappears from the span group.

Your Environment

  • Operating System: macOS Ventura 13.1 (22C65)
  • Python Version Used: 3.11.0
  • spaCy Version Used: 3.4.4
  • Environment Information:

kinghuang avatar Dec 24 '22 18:12 kinghuang

Thanks for bringing this up. It does seem a little weirdly inconsistent, but it is actually expected behaviour due a few details of the way Spans and Entities are defined. In particular, let me pull out this line from the Doc.retokenize docs:

All views of the Doc (Span and Token) created before the retokenization are invalidated, although they may accidentally continue to work.

Basically, Spans are in some sense imaginary, in that they are only views of the Doc, and not real slices of it. This is why a Span extension can't have two different values for the same Span, for example. (This model is inconsistent in a few places, which is one reason it will probably change a little in v4.) So it is completely expected that Spans break when you retokenize.

The question is, why do Entities work? It's because they're also views on token attributes like Token.ent_type_, and when doing retokenization we track those (along with other token attributes). It also works this way because doc.ents is a generator and not a list, so you don't end up with stale Span objects.

We could do something similar for Spans, but while the retokenize operations are well-defined for Entities, I'm not sure that's the case for Spans - what's right for one type of Span might be wrong for another. We can take a look at it, though - for spancat-type Spans it's easy to see that it would be helpful if they worked the same way Entities do.

polm avatar Dec 26 '22 09:12 polm

Thanks for pointing out that detail in the Doc.retokenize docs. I didn't notice it was called out there!

I understand why it works this way for ents vs span internally. The behaviour caught me off guard when I retokenized based on a span created by a SpanRuler, which then "shifted" spans in other spans groups since they're now invalid. I just assumed the spans would be adjusted somehow like the ents.

kinghuang avatar Dec 27 '22 05:12 kinghuang

My initial feeling is that Doc.spans is first-class member of the doc and the retokenization should also apply here. I'm not sure how complicated the implementation would be at this point, and there will be some tricky cases where you can't preserve the span boundaries.

adrianeboyd avatar Jan 09 '23 10:01 adrianeboyd

Spans are trickier than ents because ents have some concept of head/phrase that can be used to decide how to adjust the entity boundaries.

My initial sketch of how this could work:

  • if all the spans' char offsets still line up to token boundaries, adjust the token offsets and be happy
  • if the char offsets don't line up, refuse to retokenize by default
  • add an alignment_mode-style option similar to Doc.char_span to expand/contract spans in cases where the token boundaries don't line up and apply those modifications to all the spans in Doc.spans

In general spans (any spans, not just Doc.spans) need some concept of being "dirty" if there are modifications in the doc in the background, but this is tricky to handle efficiently. In v2, spans weren't writable and did try to automatically adjust their internal offsets if the doc was modified in the background. I'll need to look into the details a bit more.

adrianeboyd avatar Feb 10 '23 09:02 adrianeboyd