spaCy
spaCy copied to clipboard
setting an extensions attribute in one span changes it in the other
Problem
I am working with a two-level NER taxonomy, where I store the first one in Span.label_ attribute, and the second one in an extension Span._.type. I have annotations from a software that allows for span overlaps, and I am working on a script that reconciles overlapping annotations. After banging my head for a while I realized that spacy behaves quite oddly with extensions. While behavior of the .label_ suggests two spans of the same token range are separate objects, the extensions behave as if it is the same object. I find this quite odd
How to reproduce the behaviour
import spacy
from spacy.tokens import Doc, DocBin, Span
Span.set_extension("type", default=None)
nlp = spacy.load("en_core_web_md")
text = "lives with husband"
doc = nlp(text)
span1 = doc[2:3]
span1.label_ = "social_support"
span1._.type = "has_support"
span1
span2 = doc[2:3]
span2.label_ = "marital_status"
span2._.type = "married"
span2
Now, I would expect these to be two separate span objects of their own with their own labels and extension attributes, but this holds only half way:
print(span1.label_, span1._.type)
# ('social_support', 'married')
# ^^^^^ modifying second span changed the first one!
print(span2.label_, span2._.type)
# ('marital_status', 'married')
Info about spaCy
- spaCy version: 3.2.0
- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.9.7
- Pipelines: en_core_web_md (3.2.0), en_core_web_sm (3.2.0)
Yes, the custom extensions currently only use the span start/end and not any other attributes to distinguish spans. There's a related PR in progress #9708, but some of the serialization details are tricky in terms of backwards compatibility.
The newer PR didn't get linked to this issue at the time, but we decided to move these changes to v4 and #11429 fixes this bug.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.