spaCy
spaCy copied to clipboard
Lemmatization issues [Italian][Spanish][French]
Hello,
As a follow up on #11298 #11347, I would like to report some lemmatization problems with the spaCy 3.6 models for Italian, Spanish and French. We did not have these issues with the 3.2 version.
How to reproduce the behaviour
Here are some examples:
| Language | Text | Returned Lemma | Expected Lemma |
|---|---|---|---|
| it | efficiente e cortesissima. | corteso | cortese |
| it | Voglio **disabbonarmi | Voglio | volere |
| it | Voglio disabbonaremi | disabbonare | disabbonare |
| it | Bella | Bella | bello |
| it | Perde il colore | Perdere | perdere |
| it | Filtrare | Filtrare | filtrare |
| it | Non si restringono al lavaggio | restringono | restringere |
| it | Cassiera gentile | Cassiera | cassiera |
| it | Trovo sempre un sacco | Trovo | trovare |
| it | prodotto ottimo | produrre | prodotto |
| it | Buongiorno Ho ricevuto un set di calzini diverso da quello da me selezionato nell'ordine Grazie [name] | divergere | diverso |
| it | buono bebe | Bebe | bébé |
| it | Richiedo la fatturazione elettronica | Richiedo | richiedere |
| it | Cercavo una felpina | Cercavo | cercare |
| it | Cercavo una felpina | felpino | felpina |
| it | Manca lo short blu del set codice 461680 | Manca | mancare |
| it | Soddisfatta ma nonostante 2 lavaggi perde ancora pelucchi | soddisfattare | soddisfatto |
| it | Ben fatto ma troppo grande | Ben | bene |
| it | Rapiditá nella consegna | Rapidità | rapidità |
| it | (allego screenshot) | allego | allegare |
| it | Prezzi competitivi Spedizione nei tempi previsti Acquistate per il black friday... | Spedizione | spedizione |
| it | Ottima merce | Ottimo | ottimo |
| it | quando mi rimborserete | rimborsareta | rimborsare |
| it | non carica la pagina | carico | caricare |
| it | È possibile avere un contatto o un riferimento del corriere? | Corriere | corriere |
| it | Quando potrò effettuare il mio acquisto? | potrò | potere |
| it | Quando lo consegnerete? | consegnareta | consegnare |
| it | Ok...casomai rifaccio l'ordine | rifacciareta | rifare |
| it | Compro sempre ordinando on-line e ritirando in negozio, | ritirira | ritirare |
| it | Un po’ corte di manica | corte | corto |
| it | Buonasera, mi è arrivato il pacco contenente tutto tranne il jeans blu con codice di vendita [number] | contenente | contenere |
| it | che ora chiudete | chiudetere | chiudere |
| it | non riesco a tracciarlo | tracciare lo | tracciare |
| es | Problema con reparto | Problema | problema |
| es | L'app esta caÃda no puedes realizar la compra | L'app | app |
| es | Fallo en el uso de la aplicación | Fallo | fallo |
| es | Contenta | contentar | contento |
| es | solicito BAJA de la suscripción a las newsletters | solicito | solicitar |
| es | "Desde hace 17 años compro en Kiabi" | compro | comprar |
| es | Correcto | Correcto | correcto |
| es | na cola se clientes en espera | clientser | cliente |
| es | Hola a todos: que horarios tenéis | tenéis | tener |
| es | Rapidez en pedidos | rapidez | Rapidez |
| es | Mala prenda | mala | mal / malo |
| es | Estupendo | Estupendo | estupendo |
| es | no me lo han enviado pero si cobrado. | cobrado | cobrar |
| fr | Bonjour, Aurez-vous la parure | Aurez | avoir |
| fr | via le formulaire sur Internet | Internet | internet |
| fr | Jolie modèle | Jolie | joli |
I guess the issue with tokens on the beginning of sentences (because they are wrongly detected as PROPN) has been already mentioned many times.
Your Environment
- Python Version Used: 3.10
- spaCy Version Used: 3.6.1
Thanks for the examples, they'll be helpful when looking at how to improve the lemmatizers in the future!
Also in french "domicile" becomes "domicil" which is not correct but "domiciles" (plural) become correctly "domicile".
A sanity check can be added :double lemmatization should not change result.
NLP_FR = spacy.load("fr_core_news_md")
print("domicile (singular) should stay as domicile (singular)")
NLP_FR("domicile")[0].lemma_
print("domiciles (plural) should become domicile (singular)")
NLP_FR("domiciles")[0].lemma_
print("Doing a double lemmatization should not change result")
NLP_FR(NLP_FR(NLP_FR("domiciles")[0].lemma_)[0].lemma_)[0].lemma_
In version 3.7.2
The French lemmatizer in the v3.7 trained pipelines is a rule-based lemmatizer that depends on the part-of-speech tags from the statistical tagger to choose which rules to apply. In these pipelines, the tags come from the morphologizer component.
Here it looks like it's tagging "domicil" as ADJ, so incorrect rules are applied.
The statistical components like the tagger and morphologizer aren't really intended for processing individual words out of context. Even just a smidgen of context improves the results:
import spacy
nlp = spacy.load("fr_core_news_md")
assert nlp("le domicile")[1].lemma_ == "domicile"
If you want to double-check that the rules are working as intended (since sometimes it may be a problem with the rules or exceptions and not the POS tag), you can test just the lemmatizer component by providing POS tag by hand:
import spacy
nlp = spacy.load("fr_core_news_md")
doc = nlp.make_doc("domicile") # just tokenization, no pipeline components
doc[0].pos_ = "NOUN"
assert nlp.get_pipe("lemmatizer")(doc)[0].lemma_ == "domicile"
The French lemmatizer in the v3.7 trained pipelines is a rule-based lemmatizer that depends on the part-of-speech tags from the statistical tagger to choose which rules to apply. In these pipelines, the tags come from the
morphologizercomponent.Here it looks like it's tagging "domicil" as
ADJ, so incorrect rules are applied.The statistical components like the tagger and morphologizer aren't really intended for processing individual words out of context. Even just a smidgen of context improves the results:
import spacy nlp = spacy.load("fr_core_news_md") assert nlp("le domicile")[1].lemma_ == "domicile"If you want to double-check that the rules are working as intended (since sometimes it may be a problem with the rules or exceptions and not the POS tag), you can test just the lemmatizer component by providing POS tag by hand:
import spacy nlp = spacy.load("fr_core_news_md") doc = nlp.make_doc("domicile") # just tokenization, no pipeline components doc[0].pos_ = "NOUN" assert nlp.get_pipe("lemmatizer")(doc)[0].lemma_ == "domicile"
Thanks for helping. It also explains why lemmatization does not work when disabling others modules. No error but then do nothing To save resource as I need only lemmatizer, I tried this:
NLP_FR = spacy.load("fr_core_news_md", disable=["morphologizer", "parser", "senter", "ner", "attribute_ruler"])
No error when calling .lemma_, but do nothing. Would be better to throw an error if possible.
In this example, word "domicil" does not exist at all in french dictionary.
Based on data I have to handle, around 1185 items where double lemmatization provide different (better) results.
version 3.2.0 was keeping "domicile" correct when submitted for lemmatization.
Would it be possible to train it with this project http://www.lexique.org/ ? They are very good for french, just no model giving vectors
We wouldn't use the lexique data in our pipelines due to the non-commercial clause in the CC BY-NC license, but if the license works for your use case and you'd prefer to use it, it's pretty easy to create a lookup table that you can use with the lookup mode of the built-in spacy Lemmatizer.
We have an example of a French lookup lemmatizer table here:
https://github.com/explosion/spacy-lookups-data/blob/1d90ebc5fdc6ccd0f9b2447e47172986938a7ab5/spacy_lookups_data/data/fr_lemma_lookup.json
@adrianeboyd ok thanks for explanation.
I guess spacy remove the last character when it encounter an unknown word to lemmatize. It mostly hurts from my point of view
NLP_FR("xxxxx")[0].lemma_
Out[10]: 'xxxx'
NLP_FR(NLP_FR("xxxxx")[0].lemma_)[0].lemma_
Out[11]: 'xxx'
Overall it sounds like a lookup lemmatizer, which doesn't depend on context, might be a better fit for these kinds of examples. You can see how to switch from the rule-based lemmatizer to the lookup lemmatizer: https://spacy.io/models#design-modify
You can also provide your own lookup table instead of using the default one from spacy-lookups-data.
I guess spacy remove the last character when it encounter an unknown word to lemmatize. It mostly hurts from my point of view
This is not what is going on. Not that there can't be problems with the lemmatizer rules and tables, but I'd be very surprised if simply removing any final character were one of the existing rules for any of the rule-based lemmatizers provided in the trained pipelines.
You can take a closer look at the rules for French, which are here under fr_lemma_* (all the files except the _lookup.json one are used by the rule-based lemmatizer):
https://github.com/explosion/spacy-lookups-data/tree/1d90ebc5fdc6ccd0f9b2447e47172986938a7ab5/spacy_lookups_data/data
along with the language-specific lemmatizer implementation under spacy/lang/fr/lemmatizer.py.
These are suffix rewrite rules, and I think this is the rule that it's applying for the final x in nouns:
https://github.com/explosion/spacy-lookups-data/blob/1d90ebc5fdc6ccd0f9b2447e47172986938a7ab5/spacy_lookups_data/data/fr_lemma_rules.json#L62
@adrianeboyd thanks, using rules returns same results as before with 3.2, which are much better in our case. Also, using rules, no more removing of "x" in "xxxxx"