stanza icon indicating copy to clipboard operation
stanza copied to clipboard

[QUESTION] Pronouns lemmatization in "combined" italian model

Open manuelfavaro90 opened this issue 3 years ago • 8 comments

Hi everyone! I noticed that since v1.2.0 (https://github.com/stanfordnlp/stanza/releases) English and Italian models combine different treebanks "and a custom dataset including MWT tokens". Now, italian pronouns - both MWT and single tokens - are lemmatized, using combined model, differently than in treebanks (i.e. "lo" > "lui", where ISDT, VIT etc. have "lo" > "lo"). Therefore, when you try to train a new model using UD treebanks you will have a lot of discrepancies, about pronouns, between training corpus and model to be trained. The question is: when you created combined models, have you thought about how to get around this problem?

Thank you in advance for your reply!

manuelfavaro90 avatar Feb 11 '22 10:02 manuelfavaro90

That is an excellent question, and the simple truth is we didn't put a ton of effort into unifying the lemmas. There was a lot more effort in EN, since we have more EN expertise here...

There's actually quite a few inconsistencies. If you have suggestions on how to resolve them, we can do so. Often, the original maintainers can update them, if we post issues on the UD github. Here are three different treatments of similar pronouns:

ISDT:

9-10    invitarla       _       _       _       _       _       _       _       _
9       invitar invitare        VERB    V       VerbForm=Inf    33      csubj   33:csubj        _
10      la      lo      PRON    PC      Clitic=Yes|Gender=Fem|Number=Sing|Person=3|PronType=Prs 9       obj     9:obj   _

VIT:

6-7     poterla _       _       _       _       _       _       _       _
6       poter   potere  AUX     VA      VerbForm=Inf    8       aux     _       _
7       la      la      PRON    PC      Clitic=Yes|Gender=Fem|Number=Sing|Person=3|PronType=Prs 8       obj     _       _

handparsed MWT https://github.com/stanfordnlp/handparsed-treebank/blob/master/italian-mwt/italian.mwt

1-2     portarla        _       _       _       _       _       _       _       _
1       portar  portare VERB    V       Verbform=Inf    0       root    _       _
2       la      lui     PRON    PC      Clitic=Yes|Person=3|PronType=Prs        1       expl    _       _

AngledLuffa avatar Feb 12 '22 01:02 AngledLuffa

I'm sorry, I usually get notifications by mail when someone replies here, but this time I didn't... I'll discuss with my research team and we will be very happy to make suggestions. Thanks!

manuelfavaro90 avatar Feb 17 '22 16:02 manuelfavaro90

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 18 '22 20:04 stale[bot]

To answer this question, the original maintainers of the treebanks have been working on unifying the lemmas of the clitics. We'll include updated models shortly.

AngledLuffa avatar Apr 18 '22 21:04 AngledLuffa

The clitics in VIT, ISDT, and the extra MWT dataset have been unified, thanks to help from the maintainers of those datasets. This will be in the upcoming UD 2.10 release, but the data is available from github already, and I updated our models as part of the 1.4.0 release.

I actually haven't checked if there were the same changes to TWITTIRO and PoSTWITA, and I haven't checked if this addresses everything, so I'll leave this open for now until those are verified.

AngledLuffa avatar Apr 23 '22 06:04 AngledLuffa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 22 '22 19:06 stale[bot]

Thanks for the nudge, stalebot. VIT, ISDT, and the fake MWT file have been updated to have standardized lemmas as of UD 2.10. There are a few oddities in TWITTIRO and PoSTWITA where I filed issues on those datasets.

AngledLuffa avatar Jun 22 '22 23:06 AngledLuffa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 31 '22 11:08 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 30 '22 14:10 stale[bot]

Perhaps stalebot needs to be told data issues don't go stale. Or I could just keep treating it as "reminderbot" instead. postwita and twitterio had minor changes applied, but I'm kinda lazy^H^H^H^H busy to update the models before the next UD release in November

AngledLuffa avatar Oct 30 '22 18:10 AngledLuffa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 30 '22 19:12 stale[bot]

At long last, the Italian models in the dev branch use updated lemmas for the clitics from all of the treebanks and the artificial MWT dataset. That version can be accessed by using the dev branch of stanza, but the version available from 1.4.1 uses 99% of the same lemmas, so it probably isn't a big issue at this point.

AngledLuffa avatar Jan 02 '23 17:01 AngledLuffa