spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

French lemmatizer/Pos tagger issues, lemmatizer suggestion

Open databill86 opened this issue 3 years ago • 4 comments
trafficstars

Hello,

As a follow up to #11298, I would like to share some examples for which the french tagger and lemmatizer fail to return the correct results (spaCy v3.4.1 and fr_core_news_lg v3.4.0).

I'm using the "rule" mode for the french lemmatizer, but it fails to give the right lemma for some "simple" examples (eg., je vous confirme mon numéro, it returns "confirme" instead of "confirmer". So I tried to figure out how to solve this by debugging its behavior on rule_lemmatize, and I came across this:

https://github.com/explosion/spaCy/blob/3e4cf1bbe1745a55ede0dece31353aebc3f82729/spacy/lang/fr/lemmatizer.py#L74-L77

If we don't find a rule for the token we check in the lookup table. However, while debugging, I think that string in lookup_table.keys() will never be satisfied:

  • string = token.text, and
  • lookup_table.keys() is a list of odict_keys odict_keys([11901859001352538922, 11991225512756778597,...)

If I'm not mistaken, I think that's what's meant to be tested is: token.norm in lookup_table.keys().

If that's the case, I think we can safely refactor the code as:

        if not forms:
            forms.extend(oov_forms)        
            forms.append(self.lookup_lemmatize(token)[0])
        

[UPDATE]: after some more checking, I found that string in lookup_table.keys() will invoke __getitem__ of Table. But I still don't get why don't we simply use self.lookup_lemmatize(token)[0] as I suggested.

Even if the token is not in the lookup table, self.lookup_lemmatize(token)[0] will do the same thing in line 77.

This will solve the lemmatization problems for examples like: "je vous confirme mon numéro..." (confirme will be lemmatized to "confirmer" instead of "confirme").

Here are some other examples for which we have tagging and lemmatization issues:

Text Word analysed Lemma detected Lemma expected Pos detected Pos expected Comments
je remarque qu'elle n'est toujours pas parti.... parti parti partir ADJ VERB "Parti" is a past participle it should be analyzed as a verb and not as an adjective
alors que je l'avais annulé via le formulaire sur Internet. Internet Internet internet PROPN NOUN
Je vous confirme que mon numéro de téléphone est bien le [phone_number]. confirme confirme confirmer VERB VERB Problem"pos tagging" + LEMMA => same problem with the VERB "informe"
Comme vous me l’avez demandé par mail je vous communique mon numéro communique communique communiquer VERB VERB Problem"pos tagging" + LEMMA
Je vous redonne mon numéro redonne redonne redonner VERB VERB Problem"pos tagging" + LEMMA
le numéro de téléphone de votre service client communiqué en bas de vos mails communiqué communiquer communiquer ADJ VERB Problem"pos tagging"
Avez-vous un service client téléphone? téléphone téléphon téléphone ADJ NOUN Problem"pos tagging" + LEMMA
MERCI DE SUPPRIMER DEFINITIVEMENT MON COMPTE ET TOUTES LES DONNEES PERSONNELLES ASSOCIEES A CELUI-CI. SUPPRIMER supprimer supprimer NOUN VERB problem "pos tag"
MERCI DE SUPPRIMER DEFINITIVEMENT MON COMPTE ET TOUTES LES DONNEES PERSONNELLES ASSOCIEES A CELUI-CI. DEFINITIVEMENT definitivemer définitivement VERB ADV problem "pos tag" + lemma
MERCI DE SUPPRIMER DEFINITIVEMENT MON COMPTE ET TOUTES LES DONNEES PERSONNELLES ASSOCIEES A CELUI-CI. DONNEES donnee donnée NOUN NOUN problem "lemma" + lemma
MERCI DE SUPPRIMER DEFINITIVEMENT MON COMPTE ET TOUTES LES DONNEES PERSONNELLES ASSOCIEES A CELUI-CI. ASSOCIEES associee associer NOUN VERB problem "pos tag" + lemma
JE SOUHAITE IMPERATIVEMENT ETRE SUPPRIMEE DE VOS LISTES DE DIFFUSION. SOUHAITE souhait souhaiter ADJ VERB problem "pos tag" + lemma
JE SOUHAITE IMPERATIVEMENT ETRE SUPPRIMEE DE VOS LISTES DE DIFFUSION. IMPERATIVEMENT imperativement impérativement NOUN ADV problem "pos tag" + lemma
JE SOUHAITE IMPERATIVEMENT ETRE SUPPRIMEE DE VOS LISTES DE DIFFUSION. SUPPRIMEE supprime supprimer ADJ VERB problem "pos tag" + lemma
JE SOUHAITE IMPERATIVEMENT ETRE SUPPRIMEE DE VOS LISTES DE DIFFUSION. DIFFUSION DIFFUSION diffusion PROPN NOUN problem "pos tag" + lemma

How to reproduce the behaviour

Your Environment

  • Operating System:
  • Python Version Used: 3.8
  • spaCy Version Used: 3.4

databill86 avatar Aug 19 '22 16:08 databill86

Thanks for pointing this out, this does look like a bug in the lemmatizer, and seems to also affect Catalan. I've opened a PR with a fix at #11382.

About the refactoring - having a separate fallback in the French lemmatizer itself may be to avoid relying on the implementation of the lookup tables in case they're swapped out or something, so we need to check that.

polm avatar Aug 26 '22 07:08 polm

Regarding the POS tag issues, I haven't looked at that in detail yet, but keeping in mind #3052, is this a degradation from a previous version, or is it plausible this is just normal errors?

polm avatar Aug 26 '22 07:08 polm

Regarding the POS tag issues, I haven't looked at that in detail yet, but keeping in mind #3052, is this a degradation from a previous version, or is it plausible this is just normal errors?

It's actually a degradation from the previous version, and I actually didn't find what exactly changed between v3.2 and v3.4.

databill86 avatar Aug 26 '22 08:08 databill86

Thanks for the clarification, let us know if you have more details.

To clarify the status of this issue, the fix for the lemmatizer issue has been merged, so that should be resolved, and we just need to look into the POS tag issue.

polm avatar Aug 30 '22 09:08 polm

Marking this as resolved. Feel free to open a new issue if the POS problems persist!

rmitsch avatar Feb 06 '23 13:02 rmitsch

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] avatar Feb 14 '23 00:02 github-actions[bot]

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] avatar Mar 17 '23 00:03 github-actions[bot]