spaCy
spaCy copied to clipboard
Lemmas for Contractions have changed with SpaCy 3.0
How to reproduce the behaviour
With SpaCy 3.0.0:
nlp = spacy.load("en_core_web_trf") # same result with en_core_web_lg
doc = nlp("Can't go to school")
print([(token.text, token.lemma_) for token in doc])
prints the output:
[('Ca', 'ca'), ("n't", "n't"), ('go', 'go'), ('to', 'to'), ('school', 'school')]
However, with SpaCy 2.3.5, similar code:
nlp = spacy.load("en_core_web_lg")
doc = nlp("Can't go to school")
print([(token.text, token.lemma_) for token in doc])
prints output:
[('Ca', 'can'), ("n't", 'not'), ('go', 'go'), ('to', 'to'), ('school', 'school')]
Observe the differences in lemmas for Can't in the SpaCy versions above.
Your Environment
- Operating System: OSX
- Python Version Used: 3.7.6
- spaCy Version Used: 3.0.0
- Environment Information:
Info about spaCy
- spaCy version: 3.0.0
- Platform: Darwin-19.5.0-x86_64-i386-64bit
- Python version: 3.7.6
- Pipelines: en_core_web_lg (3.0.0), en_core_web_sm (3.0.0), en_core_web_trf (3.0.0)
That's a good point! We took the lemma exceptions out of the tokenizer (so the tokenizer is only dealing with tokenization) without moving them to a new component. We can plan to add the lemma exceptions back in the next time we release new models.
In the meanwhile, you can add them as attribute_ruler patterns with something like:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.get_pipe("attribute_ruler").add([[{"LOWER": "ca"}, {"LOWER": "n't"}]], {"LEMMA": "can"})
nlp.get_pipe("attribute_ruler").add([[{"LOWER": "ca"}, {"LOWER": "n't"}]], {"LEMMA": "not"}, index=1)
See: https://spacy.io/api/attributeruler
Thanks for your response! The attributeruler is a great feature and good to be aware of. It would also help a lot in migrating from older spacy to 3.0 if the lemma exceptions are added back.
I believe these attributes should be set by default. because
TOKENIZER_EXCEPTIONS = {
# do
"don't": [
{ORTH: "do", LEMMA: "do"},
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
"doesn't": [
{ORTH: "does", LEMMA: "do"},
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
"didn't": [
{ORTH: "did", LEMMA: "do"},
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
# can
"can't": [
{ORTH: "ca", LEMMA: "can"},
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
"couldn't": [
{ORTH: "could", LEMMA: "can"},
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
# have
"I've'": [
{ORTH: "I", LEMMA: "I"},
{ORTH: "'ve'", LEMMA: "have", NORM: "have", TAG: "VERB"}],
"haven't": [
{ORTH: "have", LEMMA: "have"},
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
"hasn't": [
{ORTH: "has", LEMMA: "have"},
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
"hadn't": [
{ORTH: "had", LEMMA: "have"},
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
# will/shall will be replaced by will
"I'll'": [
{ORTH: "I", LEMMA: "I"},
{ORTH: "'ll'", LEMMA: "will", NORM: "will", TAG: "VERB"}],
"he'll'": [
{ORTH: "he", LEMMA: "he"},
{ORTH: "'ll'", LEMMA: "will", NORM: "will", TAG: "VERB"}],
"she'll'": [
{ORTH: "she", LEMMA: "she"},
{ORTH: "'ll'", LEMMA: "will", NORM: "will", TAG: "VERB"}],
"it'll'": [
{ORTH: "it", LEMMA: "it"},
{ORTH: "'ll'", LEMMA: "will", NORM: "will", TAG: "VERB"}],
"won't": [
{ORTH: "wo", LEMMA: "will"},
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
"wouldn't": [
{ORTH: "would", LEMMA: "will"},
{ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
# be
"I'm'": [
{ORTH: "I", LEMMA: "I"},
{ORTH: "'m'", LEMMA: "be", NORM: "am", TAG: "VERB"}]
}
I assume these need to be set in 3.0 using attribute_ruler and not tokenizer exception because last I checked the tokenizer exceptions in 3.0 only allowed ORTH and NORM (not LEMMA).
I am new in Spacy 3 (have no idea why are attribute_ruler and tokenizer exception exist at all). just got into this strange behavior and looking for a way to get pretty lemmas.
I think this issue needs to be re-opened.
Yes, the attribute_ruler is the right place to add these exceptions in v3. We will need to add these exceptions to the attribute_ruler when we configure the pretrained pipelines for the next version of the models, which would be v3.0.1 (model version, not spaCy version).
Okay, I am reopening this as it might help track to a concrete closure. If the rules are not too many, is it possible to get all the attribute_ruler rules which would bring 3.0.x behavior back to 2.3.x in terms of contractions?
We'll plan to have some updates in the v3.1 models. This isn't 100% of the v2.3 lemma exceptions, but covers the most common contractions. You can load the patterns like this:
import srsly
patterns = srsly.read_json("ar_patterns.json")
nlp.remove_pipe("attribute_ruler")
ar = nlp.add_pipe("attribute_ruler", before="lemmatizer")
ar.add_patterns(patterns)
Updated patterns (rename to ar_patterns.json): ar_patterns.json.txt
Sorry for not following up on this sooner, but with version 3.4.3 of spaCy and 3.4.1 of en_core_web_sm this seems to be addressed. We probably handled around 3.1 but forgot to update this.
import spacy
nlp = spacy.load("en_core_web_sm")
texts = ["can't go", "won't go", "haven't gone", "I'm OK"]
for text in texts:
for tok in nlp(text):
print(tok.text, tok.lemma_, sep="\t")
print("-----")
output:
ca can
n't not
go go
-----
wo will
n't not
go go
-----
have have
n't not
gone go
-----
I I
'm be
OK ok
-----
This issue has been automatically closed because it was answered and there was no follow-up discussion.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.