stanza Wrong multi-word token expansion in Portuguese

Describe the bug In Portuguese, verbs and pronouns can contract: vê-la (see her)-> ver (verb) + ela (pronoun) I believe the mwt processor isn't correctly dealing with thoses cases

To Reproduce Code:

import stanza
nlp = stanza.Pipeline(lang='pt', processors='tokenize,mwt,pos,lemma')

text = 'Deixe-me vê-la.'
doc = nlp(text)

print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')

Result:

=======================
| Processor | Package |
-----------------------
| tokenize  | bosque  |
| mwt       | bosque  |
| pos       | bosque  |
| lemma     | bosque  |
=======================

2023-03-08 15:49:08 INFO: Use device: cpu
2023-03-08 15:49:08 INFO: Loading: tokenize
2023-03-08 15:49:08 INFO: Loading: mwt
2023-03-08 15:49:09 INFO: Loading: pos
2023-03-08 15:49:09 INFO: Loading: lemma
2023-03-08 15:49:09 INFO: Done loading processors!
word: Deixe     upos: VERB      xpos: None      feats: Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: me        upos: PRON      xpos: None      feats: Case=Acc|Number=Sing|Person=1|PronType=Prs
word: vê        upos: VERB      xpos: None      feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: lla       upos: NOUN      xpos: None      feats: Gender=Masc|Number=Sing
word: . upos: PUNCT     xpos: None      feats: _

Expected behavior The mwt expands the token "la" to "lla", which doesn't exist in Portuguese. I'm not actually 100% sure what the correct token would be, but I'm pretty sure it should be "ela". This problem also exists with other verb + pronoun contraction, such as "vê-lo" (with the masculine form) or "deixe-me" (with a 1st person pronoun)

Environment (please complete the following information):

OS: Ubuntu (WSL)
Python version: 3.8
Stanza version: 1.4.1

Mar 08 '23 16:03 leonorv

There are quite a few instances of -la in the training which get split into la (not ela, fwiw). I'm not sure of a good solution off the top of my head, but I'll think about it

Mar 09 '23 00:03 AngledLuffa

Yeah, there are instances where -la is split to la, but interestingly I've also found cases where -la is split to -la. For example:

import stanza
nlp = stanza.Pipeline(lang='pt', processors='tokenize,mwt,pos,lemma')

text = 'Deixe-me. Lembra-me. Dá-la. Comê-la. Lembrá-la. Ver-te. Dê-ma.'
doc = nlp(text)

print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')

Outputs:

word: Deixe     upos: VERB      xpos: None      feats: Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: me        upos: PRON      xpos: None      feats: Case=Acc|Number=Sing|Person=1|PronType=Prs
word: . upos: PUNCT     xpos: None      feats: _
word: Lembra    upos: VERB      xpos: None      feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: me        upos: PRON      xpos: None      feats: Case=Acc|Number=Sing|Person=1|PronType=Prs
word: . upos: PUNCT     xpos: None      feats: _
word: Dá        upos: VERB      xpos: None      feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: -la       upos: PRON      xpos: None      feats: Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs
word: . upos: PUNCT     xpos: None      feats: _
word: Comê      upos: VERB      xpos: None      feats: VerbForm=Inf
word: la        upos: PRON      xpos: None      feats: Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs
word: . upos: PUNCT     xpos: None      feats: _
word: Lembrá    upos: VERB      xpos: None      feats: VerbForm=Inf
word: la        upos: PRON      xpos: None      feats: Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs
word: . upos: PUNCT     xpos: None      feats: _
word: Ver       upos: VERB      xpos: None      feats: VerbForm=Inf
word: te        upos: PRON      xpos: None      feats: Case=Acc|Number=Sing|Person=1|PronType=Prs
word: . upos: PUNCT     xpos: None      feats: _
word: Dê        upos: VERB      xpos: None      feats: Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: -ma       upos: PROPN     xpos: None      feats: Gender=Masc|Number=Sing
word: . upos: PUNCT     xpos: None      feats: _

The 'Dê-ma' case might be tricky because it means 'Give her to me', and i'm not even sure how the tokens could be split.

Mar 09 '23 10:03 leonorv

stanza stanza copied to clipboard

Wrong multi-word token expansion in Portuguese

stanza
stanza copied to clipboard