spaCy Spanish lemmatizer returns unexisting lemmas for accented 3rd-person singular forms of present indicative

How to reproduce the behaviour

import spacy
nlp = spacy.load("es_core_news_lg")

def parse(sentence):
    doc = nlp(sentence)
    token = doc[1]
    print((token.text, token.lemma_, token.pos_, token.morph))

parse("él acentúa")
parse("él actúa")
parse("él amplía")
parse("él continúa")
parse("él desvía")
parse("él desvirtúa")
parse("él envía")
parse("él guía")
parse("él puntúa")
parse("él sitúa")

Current output:

('acentúa', 'acentúar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('actúa', 'actúar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('amplía', 'amplíar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('continúa', 'continúar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('desvía', 'desvíar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('desvirtúa', 'desvirtúar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('envía', 'envíar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('guía', 'guíar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('puntúa', 'puntúar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('sitúa', 'sitúar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)

Expected output:

('acentúa', 'acentuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('actúa', 'actuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('amplía', 'ampliar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('continúa', 'continuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('desvía', 'desviar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('desvirtúa', 'desvirtuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('envía', 'enviar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('guía', 'guiar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('puntúa', 'puntuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('sitúa', 'situar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)

Note thay many more verbs have this same problem. No verbs in Spanish in their infinitive form end with -íar nor -úar. I guess they should be replaced by their unaccented versions.

Your Environment

spaCy version: 3.3.1
Platform: Linux-5.15.0-43-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Pipelines: es_core_news_sm (3.3.0), es_dep_news_trf (3.3.0), es_core_news_lg (3.3.0), es_core_news_md (3.3.0)

Aug 04 '22 09:08 nimbusaeta

Thanks for the report! These forms are generated by rules in spacy-lookups-data, in particular these rules for these forms, I think:

https://github.com/explosion/spacy-lookups-data/blob/77ce8d98dda96f76a4fdbd48b1f8c03cf3ed9577/spacy_lookups_data/data/es_lemma_rules.json#L206-L259

These are suffix replacement rules that are applied one by one in order, so I think you could add rules towards the end of this list that generate the correct forms (but note that I haven't tested this in detail, there may be some interactions between rules that I have overlooked).

You can also modify the rules in the pipeline directly in the tables in nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules"). Be aware that there's a lemmatizer cache, so modify the rules before processing any texts to make sure you're not getting a cached lemma instead of seeing the new rules applied.

You are welcome to create a PR for spacy-lookups-data if you're interested, otherwise we'll put this on our to-do list.

Aug 08 '22 12:08 adrianeboyd

Thank you for the pointer! I looked for the place where the rules were missing without any success. I'll add them and send a PR as soon as I can.

Aug 08 '22 15:08 nimbusaeta

Thanks @nimbusaeta, this will be great to have!

There are a few more verb forms that behave in the same way; it would be great if you could add rules for them:

present subjunctive, all three singular forms and third person plural, e.g. amplíe, actúe (rules from line 1286)
second person singular imperative, e.g. amplía, actúa (rules from line 4066)

Aug 30 '22 16:08 richardpaulhudson

PR updated! Happy to help :)

Aug 31 '22 18:08 nimbusaeta

Hi @nimbusaeta, thanks for your commit. I've also added the third-person plural present subjunctive forms. Could you please check you agree these are linguistically correct, and if you're happy with them I'll approve the PR.

Sep 06 '22 08:09 richardpaulhudson

This issue has been automatically closed because it was answered and there was no follow-up discussion.

Sep 16 '22 07:09 github-actions[bot]

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Oct 17 '22 00:10 github-actions[bot]

spaCy spaCy copied to clipboard

Spanish lemmatizer returns unexisting lemmas for accented 3rd-person singular forms of present indicative

How to reproduce the behaviour

Your Environment

spaCy
spaCy copied to clipboard