spaCy
spaCy copied to clipboard
Spanish lemmatizer returns unexisting lemmas for accented 3rd-person singular forms of present indicative
How to reproduce the behaviour
import spacy
nlp = spacy.load("es_core_news_lg")
def parse(sentence):
    doc = nlp(sentence)
    token = doc[1]
    print((token.text, token.lemma_, token.pos_, token.morph))
parse("Ă©l acentĂșa")
parse("Ă©l actĂșa")
parse("Ă©l amplĂa")
parse("Ă©l continĂșa")
parse("Ă©l desvĂa")
parse("Ă©l desvirtĂșa")
parse("Ă©l envĂa")
parse("Ă©l guĂa")
parse("Ă©l puntĂșa")
parse("Ă©l sitĂșa")
Current output:
('acentĂșa', 'acentĂșar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('actĂșa', 'actĂșar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('amplĂa', 'amplĂar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('continĂșa', 'continĂșar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('desvĂa', 'desvĂar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('desvirtĂșa', 'desvirtĂșar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('envĂa', 'envĂar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('guĂa', 'guĂar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('puntĂșa', 'puntĂșar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('sitĂșa', 'sitĂșar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
Expected output:
('acentĂșa', 'acentuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('actĂșa', 'actuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('amplĂa', 'ampliar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('continĂșa', 'continuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('desvĂa', 'desviar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('desvirtĂșa', 'desvirtuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('envĂa', 'enviar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('guĂa', 'guiar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('puntĂșa', 'puntuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('sitĂșa', 'situar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
Note thay many more verbs have this same problem. No verbs in Spanish in their infinitive form end with -Ăar nor -Ășar. I guess they should be replaced by their unaccented versions.
Your Environment
- spaCy version: 3.3.1
- Platform: Linux-5.15.0-43-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Pipelines: es_core_news_sm (3.3.0), es_dep_news_trf (3.3.0), es_core_news_lg (3.3.0), es_core_news_md (3.3.0)
Thanks for the report! These forms are generated by rules in spacy-lookups-data, in particular these rules for these forms, I think:
https://github.com/explosion/spacy-lookups-data/blob/77ce8d98dda96f76a4fdbd48b1f8c03cf3ed9577/spacy_lookups_data/data/es_lemma_rules.json#L206-L259
These are suffix replacement rules that are applied one by one in order, so I think you could add rules towards the end of this list that generate the correct forms (but note that I haven't tested this in detail, there may be some interactions between rules that I have overlooked).
You can also modify the rules in the pipeline directly in the tables in nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules"). Be aware that there's a lemmatizer cache, so modify the rules before processing any texts to make sure you're not getting a cached lemma instead of seeing the new rules applied.
You are welcome to create a PR for spacy-lookups-data if you're interested, otherwise we'll put this on our to-do list.
Thank you for the pointer! I looked for the place where the rules were missing without any success. I'll add them and send a PR as soon as I can.
Thanks @nimbusaeta, this will be great to have!
There are a few more verb forms that behave in the same way; it would be great if you could add rules for them:
- present subjunctive, all three singular forms and third person plural, e.g.
amplĂe,actĂșe(rules from line 1286) - second person singular imperative, e.g.
amplĂa,actĂșa(rules from line 4066)
PR updated! Happy to help :)
Hi @nimbusaeta, thanks for your commit. I've also added the third-person plural present subjunctive forms. Could you please check you agree these are linguistically correct, and if you're happy with them I'll approve the PR.
This issue has been automatically closed because it was answered and there was no follow-up discussion.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.