spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Spanish lemmatizer returns unexisting lemmas for accented 3rd-person singular forms of present indicative

Open nimbusaeta opened this issue 3 years ago ‱ 2 comments

How to reproduce the behaviour

import spacy
nlp = spacy.load("es_core_news_lg")

def parse(sentence):
    doc = nlp(sentence)
    token = doc[1]
    print((token.text, token.lemma_, token.pos_, token.morph))

parse("Ă©l acentĂșa")
parse("Ă©l actĂșa")
parse("él amplía")
parse("Ă©l continĂșa")
parse("él desvía")
parse("Ă©l desvirtĂșa")
parse("él envía")
parse("él guía")
parse("Ă©l puntĂșa")
parse("Ă©l sitĂșa")

Current output:

('acentĂșa', 'acentĂșar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('actĂșa', 'actĂșar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('amplĂ­a', 'amplĂ­ar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('continĂșa', 'continĂșar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('desvĂ­a', 'desvĂ­ar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('desvirtĂșa', 'desvirtĂșar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('envĂ­a', 'envĂ­ar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('guĂ­a', 'guĂ­ar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('puntĂșa', 'puntĂșar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('sitĂșa', 'sitĂșar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)

Expected output:

('acentĂșa', 'acentuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('actĂșa', 'actuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('amplĂ­a', 'ampliar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('continĂșa', 'continuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('desvĂ­a', 'desviar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('desvirtĂșa', 'desvirtuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('envĂ­a', 'enviar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('guĂ­a', 'guiar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('puntĂșa', 'puntuar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)
('sitĂșa', 'situar', 'VERB', Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)

Note thay many more verbs have this same problem. No verbs in Spanish in their infinitive form end with -Ă­ar nor -Ășar. I guess they should be replaced by their unaccented versions.

Your Environment

  • spaCy version: 3.3.1
  • Platform: Linux-5.15.0-43-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Pipelines: es_core_news_sm (3.3.0), es_dep_news_trf (3.3.0), es_core_news_lg (3.3.0), es_core_news_md (3.3.0)

nimbusaeta avatar Aug 04 '22 09:08 nimbusaeta

Thanks for the report! These forms are generated by rules in spacy-lookups-data, in particular these rules for these forms, I think:

https://github.com/explosion/spacy-lookups-data/blob/77ce8d98dda96f76a4fdbd48b1f8c03cf3ed9577/spacy_lookups_data/data/es_lemma_rules.json#L206-L259

These are suffix replacement rules that are applied one by one in order, so I think you could add rules towards the end of this list that generate the correct forms (but note that I haven't tested this in detail, there may be some interactions between rules that I have overlooked).

You can also modify the rules in the pipeline directly in the tables in nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules"). Be aware that there's a lemmatizer cache, so modify the rules before processing any texts to make sure you're not getting a cached lemma instead of seeing the new rules applied.

You are welcome to create a PR for spacy-lookups-data if you're interested, otherwise we'll put this on our to-do list.

adrianeboyd avatar Aug 08 '22 12:08 adrianeboyd

Thank you for the pointer! I looked for the place where the rules were missing without any success. I'll add them and send a PR as soon as I can.

nimbusaeta avatar Aug 08 '22 15:08 nimbusaeta

Thanks @nimbusaeta, this will be great to have!

There are a few more verb forms that behave in the same way; it would be great if you could add rules for them:

  • present subjunctive, all three singular forms and third person plural, e.g. amplĂ­e, actĂșe (rules from line 1286)
  • second person singular imperative, e.g. amplĂ­a, actĂșa (rules from line 4066)

richardpaulhudson avatar Aug 30 '22 16:08 richardpaulhudson

PR updated! Happy to help :)

nimbusaeta avatar Aug 31 '22 18:08 nimbusaeta

Hi @nimbusaeta, thanks for your commit. I've also added the third-person plural present subjunctive forms. Could you please check you agree these are linguistically correct, and if you're happy with them I'll approve the PR.

richardpaulhudson avatar Sep 06 '22 08:09 richardpaulhudson

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] avatar Sep 16 '22 07:09 github-actions[bot]

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] avatar Oct 17 '22 00:10 github-actions[bot]