stanza
stanza copied to clipboard
Wrong multi-word token expansion in Portuguese
Describe the bug In Portuguese, verbs and pronouns can contract: vê-la (see her)-> ver (verb) + ela (pronoun) I believe the mwt processor isn't correctly dealing with thoses cases
To Reproduce Code:
import stanza
nlp = stanza.Pipeline(lang='pt', processors='tokenize,mwt,pos,lemma')
text = 'Deixe-me vê-la.'
doc = nlp(text)
print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')
Result:
=======================
| Processor | Package |
-----------------------
| tokenize | bosque |
| mwt | bosque |
| pos | bosque |
| lemma | bosque |
=======================
2023-03-08 15:49:08 INFO: Use device: cpu
2023-03-08 15:49:08 INFO: Loading: tokenize
2023-03-08 15:49:08 INFO: Loading: mwt
2023-03-08 15:49:09 INFO: Loading: pos
2023-03-08 15:49:09 INFO: Loading: lemma
2023-03-08 15:49:09 INFO: Done loading processors!
word: Deixe upos: VERB xpos: None feats: Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: me upos: PRON xpos: None feats: Case=Acc|Number=Sing|Person=1|PronType=Prs
word: vê upos: VERB xpos: None feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: lla upos: NOUN xpos: None feats: Gender=Masc|Number=Sing
word: . upos: PUNCT xpos: None feats: _
Expected behavior The mwt expands the token "la" to "lla", which doesn't exist in Portuguese. I'm not actually 100% sure what the correct token would be, but I'm pretty sure it should be "ela". This problem also exists with other verb + pronoun contraction, such as "vê-lo" (with the masculine form) or "deixe-me" (with a 1st person pronoun)
Environment (please complete the following information):
- OS: Ubuntu (WSL)
- Python version: 3.8
- Stanza version: 1.4.1
There are quite a few instances of -la
in the training which get split into la
(not ela
, fwiw). I'm not sure of a good solution off the top of my head, but I'll think about it
Yeah, there are instances where -la
is split to la
, but interestingly I've also found cases where -la
is split to -la
.
For example:
import stanza
nlp = stanza.Pipeline(lang='pt', processors='tokenize,mwt,pos,lemma')
text = 'Deixe-me. Lembra-me. Dá-la. Comê-la. Lembrá-la. Ver-te. Dê-ma.'
doc = nlp(text)
print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')
Outputs:
word: Deixe upos: VERB xpos: None feats: Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: me upos: PRON xpos: None feats: Case=Acc|Number=Sing|Person=1|PronType=Prs
word: . upos: PUNCT xpos: None feats: _
word: Lembra upos: VERB xpos: None feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: me upos: PRON xpos: None feats: Case=Acc|Number=Sing|Person=1|PronType=Prs
word: . upos: PUNCT xpos: None feats: _
word: Dá upos: VERB xpos: None feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: -la upos: PRON xpos: None feats: Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs
word: . upos: PUNCT xpos: None feats: _
word: Comê upos: VERB xpos: None feats: VerbForm=Inf
word: la upos: PRON xpos: None feats: Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs
word: . upos: PUNCT xpos: None feats: _
word: Lembrá upos: VERB xpos: None feats: VerbForm=Inf
word: la upos: PRON xpos: None feats: Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs
word: . upos: PUNCT xpos: None feats: _
word: Ver upos: VERB xpos: None feats: VerbForm=Inf
word: te upos: PRON xpos: None feats: Case=Acc|Number=Sing|Person=1|PronType=Prs
word: . upos: PUNCT xpos: None feats: _
word: Dê upos: VERB xpos: None feats: Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: -ma upos: PROPN xpos: None feats: Gender=Masc|Number=Sing
word: . upos: PUNCT xpos: None feats: _
The 'Dê-ma' case might be tricky because it means 'Give her to me', and i'm not even sure how the tokens could be split.