spaCy
spaCy copied to clipboard
French lemmatizer/Pos tagger issues, lemmatizer suggestion
Hello,
As a follow up to #11298, I would like to share some examples for which the french tagger and lemmatizer fail to return the correct results (spaCy v3.4.1 and fr_core_news_lg v3.4.0).
I'm using the "rule" mode for the french lemmatizer, but it fails to give the right lemma for some "simple" examples (eg., je vous confirme mon numéro, it returns "confirme" instead of "confirmer".
So I tried to figure out how to solve this by debugging its behavior on rule_lemmatize, and I came across this:
https://github.com/explosion/spaCy/blob/3e4cf1bbe1745a55ede0dece31353aebc3f82729/spacy/lang/fr/lemmatizer.py#L74-L77
If we don't find a rule for the token we check in the lookup table.
However, while debugging, I think that string in lookup_table.keys() will never be satisfied:
string = token.text,andlookup_table.keys()is a list of odict_keysodict_keys([11901859001352538922, 11991225512756778597,...)
If I'm not mistaken, I think that's what's meant to be tested is: token.norm in lookup_table.keys().
If that's the case, I think we can safely refactor the code as:
if not forms:
forms.extend(oov_forms)
forms.append(self.lookup_lemmatize(token)[0])
[UPDATE]: after some more checking, I found that string in lookup_table.keys() will invoke __getitem__ of Table. But I still don't get why don't we simply use self.lookup_lemmatize(token)[0] as I suggested.
Even if the token is not in the lookup table, self.lookup_lemmatize(token)[0] will do the same thing in line 77.
This will solve the lemmatization problems for examples like: "je vous confirme mon numéro..." (confirme will be lemmatized to "confirmer" instead of "confirme").
Here are some other examples for which we have tagging and lemmatization issues:
| Text | Word analysed | Lemma detected | Lemma expected | Pos detected | Pos expected | Comments |
|---|---|---|---|---|---|---|
| je remarque qu'elle n'est toujours pas parti.... | parti | parti | partir | ADJ | VERB | "Parti" is a past participle it should be analyzed as a verb and not as an adjective |
| alors que je l'avais annulé via le formulaire sur Internet. | Internet | Internet | internet | PROPN | NOUN | |
| Je vous confirme que mon numéro de téléphone est bien le [phone_number]. | confirme | confirme | confirmer | VERB | VERB | Problem"pos tagging" + LEMMA => same problem with the VERB "informe" |
| Comme vous me l’avez demandé par mail je vous communique mon numéro | communique | communique | communiquer | VERB | VERB | Problem"pos tagging" + LEMMA |
| Je vous redonne mon numéro | redonne | redonne | redonner | VERB | VERB | Problem"pos tagging" + LEMMA |
| le numéro de téléphone de votre service client communiqué en bas de vos mails | communiqué | communiquer | communiquer | ADJ | VERB | Problem"pos tagging" |
| Avez-vous un service client téléphone? | téléphone | téléphon | téléphone | ADJ | NOUN | Problem"pos tagging" + LEMMA |
| MERCI DE SUPPRIMER DEFINITIVEMENT MON COMPTE ET TOUTES LES DONNEES PERSONNELLES ASSOCIEES A CELUI-CI. | SUPPRIMER | supprimer | supprimer | NOUN | VERB | problem "pos tag" |
| MERCI DE SUPPRIMER DEFINITIVEMENT MON COMPTE ET TOUTES LES DONNEES PERSONNELLES ASSOCIEES A CELUI-CI. | DEFINITIVEMENT | definitivemer | définitivement | VERB | ADV | problem "pos tag" + lemma |
| MERCI DE SUPPRIMER DEFINITIVEMENT MON COMPTE ET TOUTES LES DONNEES PERSONNELLES ASSOCIEES A CELUI-CI. | DONNEES | donnee | donnée | NOUN | NOUN | problem "lemma" + lemma |
| MERCI DE SUPPRIMER DEFINITIVEMENT MON COMPTE ET TOUTES LES DONNEES PERSONNELLES ASSOCIEES A CELUI-CI. | ASSOCIEES | associee | associer | NOUN | VERB | problem "pos tag" + lemma |
| JE SOUHAITE IMPERATIVEMENT ETRE SUPPRIMEE DE VOS LISTES DE DIFFUSION. | SOUHAITE | souhait | souhaiter | ADJ | VERB | problem "pos tag" + lemma |
| JE SOUHAITE IMPERATIVEMENT ETRE SUPPRIMEE DE VOS LISTES DE DIFFUSION. | IMPERATIVEMENT | imperativement | impérativement | NOUN | ADV | problem "pos tag" + lemma |
| JE SOUHAITE IMPERATIVEMENT ETRE SUPPRIMEE DE VOS LISTES DE DIFFUSION. | SUPPRIMEE | supprime | supprimer | ADJ | VERB | problem "pos tag" + lemma |
| JE SOUHAITE IMPERATIVEMENT ETRE SUPPRIMEE DE VOS LISTES DE DIFFUSION. | DIFFUSION | DIFFUSION | diffusion | PROPN | NOUN | problem "pos tag" + lemma |
How to reproduce the behaviour
Your Environment
- Operating System:
- Python Version Used: 3.8
- spaCy Version Used: 3.4
Thanks for pointing this out, this does look like a bug in the lemmatizer, and seems to also affect Catalan. I've opened a PR with a fix at #11382.
About the refactoring - having a separate fallback in the French lemmatizer itself may be to avoid relying on the implementation of the lookup tables in case they're swapped out or something, so we need to check that.
Regarding the POS tag issues, I haven't looked at that in detail yet, but keeping in mind #3052, is this a degradation from a previous version, or is it plausible this is just normal errors?
Regarding the POS tag issues, I haven't looked at that in detail yet, but keeping in mind #3052, is this a degradation from a previous version, or is it plausible this is just normal errors?
It's actually a degradation from the previous version, and I actually didn't find what exactly changed between v3.2 and v3.4.
Thanks for the clarification, let us know if you have more details.
To clarify the status of this issue, the fix for the lemmatizer issue has been merged, so that should be resolved, and we just need to look into the POS tag issue.
Marking this as resolved. Feel free to open a new issue if the POS problems persist!
This issue has been automatically closed because it was answered and there was no follow-up discussion.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.