spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Lemmas for Contractions have changed with SpaCy 3.0

Open shubhomoydas opened this issue 4 years ago • 8 comments

How to reproduce the behaviour

With SpaCy 3.0.0:

nlp = spacy.load("en_core_web_trf")  # same result with en_core_web_lg
doc = nlp("Can't go to school")
print([(token.text, token.lemma_) for token in doc])

prints the output:

[('Ca', 'ca'), ("n't", "n't"), ('go', 'go'), ('to', 'to'), ('school', 'school')]

However, with SpaCy 2.3.5, similar code:

nlp = spacy.load("en_core_web_lg")
doc = nlp("Can't go to school")
print([(token.text, token.lemma_) for token in doc])

prints output:

[('Ca', 'can'), ("n't", 'not'), ('go', 'go'), ('to', 'to'), ('school', 'school')]

Observe the differences in lemmas for Can't in the SpaCy versions above.

Your Environment

  • Operating System: OSX
  • Python Version Used: 3.7.6
  • spaCy Version Used: 3.0.0
  • Environment Information:

Info about spaCy

  • spaCy version: 3.0.0
  • Platform: Darwin-19.5.0-x86_64-i386-64bit
  • Python version: 3.7.6
  • Pipelines: en_core_web_lg (3.0.0), en_core_web_sm (3.0.0), en_core_web_trf (3.0.0)

shubhomoydas avatar Feb 10 '21 20:02 shubhomoydas

That's a good point! We took the lemma exceptions out of the tokenizer (so the tokenizer is only dealing with tokenization) without moving them to a new component. We can plan to add the lemma exceptions back in the next time we release new models.

In the meanwhile, you can add them as attribute_ruler patterns with something like:

import spacy
nlp = spacy.load("en_core_web_sm")
nlp.get_pipe("attribute_ruler").add([[{"LOWER": "ca"}, {"LOWER": "n't"}]], {"LEMMA": "can"})
nlp.get_pipe("attribute_ruler").add([[{"LOWER": "ca"}, {"LOWER": "n't"}]], {"LEMMA": "not"}, index=1)

See: https://spacy.io/api/attributeruler

adrianeboyd avatar Feb 15 '21 12:02 adrianeboyd

Thanks for your response! The attributeruler is a great feature and good to be aware of. It would also help a lot in migrating from older spacy to 3.0 if the lemma exceptions are added back.

shubhomoydas avatar Feb 15 '21 18:02 shubhomoydas

I believe these attributes should be set by default. because

TOKENIZER_EXCEPTIONS = {
# do
    "don't": [
        {ORTH: "do", LEMMA: "do"},
        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
    "doesn't": [
        {ORTH: "does", LEMMA: "do"},
        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
    "didn't": [
        {ORTH: "did", LEMMA: "do"},
        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
# can
    "can't": [
        {ORTH: "ca", LEMMA: "can"},
        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
    "couldn't": [
        {ORTH: "could", LEMMA: "can"},
        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
# have
    "I've'": [
        {ORTH: "I", LEMMA: "I"},
        {ORTH: "'ve'", LEMMA: "have", NORM: "have", TAG: "VERB"}],
    "haven't": [
        {ORTH: "have", LEMMA: "have"},
        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
    "hasn't": [
        {ORTH: "has", LEMMA: "have"},
        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
    "hadn't": [
        {ORTH: "had", LEMMA: "have"},
        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
# will/shall will be replaced by will
    "I'll'": [
        {ORTH: "I", LEMMA: "I"},
        {ORTH: "'ll'", LEMMA: "will", NORM: "will", TAG: "VERB"}],
    "he'll'": [
        {ORTH: "he", LEMMA: "he"},
        {ORTH: "'ll'", LEMMA: "will", NORM: "will", TAG: "VERB"}],
    "she'll'": [
        {ORTH: "she", LEMMA: "she"},
        {ORTH: "'ll'", LEMMA: "will", NORM: "will", TAG: "VERB"}],
    "it'll'": [
        {ORTH: "it", LEMMA: "it"},
        {ORTH: "'ll'", LEMMA: "will", NORM: "will", TAG: "VERB"}],
    "won't": [
        {ORTH: "wo", LEMMA: "will"},
        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
    "wouldn't": [
        {ORTH: "would", LEMMA: "will"},
        {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}],
# be
    "I'm'": [
        {ORTH: "I", LEMMA: "I"},
        {ORTH: "'m'", LEMMA: "be", NORM: "am", TAG: "VERB"}]
}

veonua avatar Feb 16 '21 08:02 veonua

I assume these need to be set in 3.0 using attribute_ruler and not tokenizer exception because last I checked the tokenizer exceptions in 3.0 only allowed ORTH and NORM (not LEMMA).

shubhomoydas avatar Feb 16 '21 09:02 shubhomoydas

I am new in Spacy 3 (have no idea why are attribute_ruler and tokenizer exception exist at all). just got into this strange behavior and looking for a way to get pretty lemmas.

I think this issue needs to be re-opened.

veonua avatar Feb 16 '21 09:02 veonua

Yes, the attribute_ruler is the right place to add these exceptions in v3. We will need to add these exceptions to the attribute_ruler when we configure the pretrained pipelines for the next version of the models, which would be v3.0.1 (model version, not spaCy version).

adrianeboyd avatar Feb 16 '21 09:02 adrianeboyd

Okay, I am reopening this as it might help track to a concrete closure. If the rules are not too many, is it possible to get all the attribute_ruler rules which would bring 3.0.x behavior back to 2.3.x in terms of contractions?

shubhomoydas avatar Feb 16 '21 09:02 shubhomoydas

We'll plan to have some updates in the v3.1 models. This isn't 100% of the v2.3 lemma exceptions, but covers the most common contractions. You can load the patterns like this:

import srsly

patterns = srsly.read_json("ar_patterns.json")
nlp.remove_pipe("attribute_ruler")
ar = nlp.add_pipe("attribute_ruler", before="lemmatizer")
ar.add_patterns(patterns)

Updated patterns (rename to ar_patterns.json): ar_patterns.json.txt

adrianeboyd avatar Mar 10 '21 10:03 adrianeboyd

Sorry for not following up on this sooner, but with version 3.4.3 of spaCy and 3.4.1 of en_core_web_sm this seems to be addressed. We probably handled around 3.1 but forgot to update this.

import spacy

nlp = spacy.load("en_core_web_sm")
texts = ["can't go", "won't go", "haven't gone", "I'm OK"]
for text in texts:
    for tok in nlp(text):
        print(tok.text, tok.lemma_, sep="\t")
    print("-----")

output:

ca	can
n't	not
go	go
-----
wo	will
n't	not
go	go
-----
have	have
n't	not
gone	go
-----
I	I
'm	be
OK	ok
-----

polm avatar Dec 06 '22 11:12 polm

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] avatar Dec 14 '22 00:12 github-actions[bot]

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] avatar Jan 14 '23 00:01 github-actions[bot]