stanza
stanza copied to clipboard
[QUESTION] Pronouns lemmatization in "combined" italian model
Hi everyone! I noticed that since v1.2.0 (https://github.com/stanfordnlp/stanza/releases) English and Italian models combine different treebanks "and a custom dataset including MWT tokens". Now, italian pronouns - both MWT and single tokens - are lemmatized, using combined model, differently than in treebanks (i.e. "lo" > "lui", where ISDT, VIT etc. have "lo" > "lo"). Therefore, when you try to train a new model using UD treebanks you will have a lot of discrepancies, about pronouns, between training corpus and model to be trained. The question is: when you created combined models, have you thought about how to get around this problem?
Thank you in advance for your reply!
That is an excellent question, and the simple truth is we didn't put a ton of effort into unifying the lemmas. There was a lot more effort in EN, since we have more EN expertise here...
There's actually quite a few inconsistencies. If you have suggestions on how to resolve them, we can do so. Often, the original maintainers can update them, if we post issues on the UD github. Here are three different treatments of similar pronouns:
ISDT:
9-10 invitarla _ _ _ _ _ _ _ _
9 invitar invitare VERB V VerbForm=Inf 33 csubj 33:csubj _
10 la lo PRON PC Clitic=Yes|Gender=Fem|Number=Sing|Person=3|PronType=Prs 9 obj 9:obj _
VIT:
6-7 poterla _ _ _ _ _ _ _ _
6 poter potere AUX VA VerbForm=Inf 8 aux _ _
7 la la PRON PC Clitic=Yes|Gender=Fem|Number=Sing|Person=3|PronType=Prs 8 obj _ _
handparsed MWT https://github.com/stanfordnlp/handparsed-treebank/blob/master/italian-mwt/italian.mwt
1-2 portarla _ _ _ _ _ _ _ _
1 portar portare VERB V Verbform=Inf 0 root _ _
2 la lui PRON PC Clitic=Yes|Person=3|PronType=Prs 1 expl _ _
I'm sorry, I usually get notifications by mail when someone replies here, but this time I didn't... I'll discuss with my research team and we will be very happy to make suggestions. Thanks!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
To answer this question, the original maintainers of the treebanks have been working on unifying the lemmas of the clitics. We'll include updated models shortly.
The clitics in VIT, ISDT, and the extra MWT dataset have been unified, thanks to help from the maintainers of those datasets. This will be in the upcoming UD 2.10 release, but the data is available from github already, and I updated our models as part of the 1.4.0 release.
I actually haven't checked if there were the same changes to TWITTIRO and PoSTWITA, and I haven't checked if this addresses everything, so I'll leave this open for now until those are verified.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Thanks for the nudge, stalebot. VIT, ISDT, and the fake MWT file have been updated to have standardized lemmas as of UD 2.10. There are a few oddities in TWITTIRO and PoSTWITA where I filed issues on those datasets.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Perhaps stalebot needs to be told data issues don't go stale. Or I could just keep treating it as "reminderbot" instead. postwita and twitterio had minor changes applied, but I'm kinda lazy^H^H^H^H busy to update the models before the next UD release in November
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
At long last, the Italian models in the dev branch use updated lemmas for the clitics from all of the treebanks and the artificial MWT dataset. That version can be accessed by using the dev branch of stanza, but the version available from 1.4.1 uses 99% of the same lemmas, so it probably isn't a big issue at this point.