stanza icon indicating copy to clipboard operation
stanza copied to clipboard

More combined models?

Open amir-zeldes opened this issue 1 year ago • 29 comments

I saw the great idea for combined models here:

https://stanfordnlp.github.io/stanza/combined_models.html

Is there a process to request more of these? Specifically I was thinking of Hebrew right now.

amir-zeldes avatar Aug 25 '22 19:08 amir-zeldes

For English, we did the best we can to unify the xpos tags features. Same with Italian. That was a bit more difficult, since no one on the project speaks Italian, but the treebank maintainers were quite helpful with that.

If the Hebrew standards are the same across different treebanks, we can mix them together and see what happens. I assume you're talking about IAHLT and HTB?

AngledLuffa avatar Aug 25 '22 19:08 AngledLuffa

There's also Hebrew NER and constituency datasets which can be added, if you want to see more Hebrew stuff in general in Stanza. Sentiment dataset here:

https://github.com/OnlpLab/Hebrew-Sentiment-Data

Overall, a bunch of stuff we could add if we took some effort to improve our Hebrew pipeline.

AngledLuffa avatar Aug 25 '22 20:08 AngledLuffa

I took a look at the two treebanks.

First thing to note is that they seem to follow similar MWT splitting guidelines, although I don't read Hebrew so I don't know how similar they are. Would you double check that?

I can see the xpos tags are the same as the upos in both treebanks, so no questions there. It appears roughly the same fraction of words are featurized in both sets. However, there are some differences between the two sets:

Feature differences:

In IAHLT, not HTB:

Aspect=Prog
Foreign=Yes
HebBinyan=NITPAEL
NumType=Card
NumType=Ord
Poss=Yes
Tense=Pres

in HTB, not IAHLT:

Case=Tem
HebExistential=Yes
Number=Dual,Plur
Person=1,2,3
Typo=Yes

If this is something you can help unify, we'll be happy to make combined models.

AngledLuffa avatar Aug 25 '22 20:08 AngledLuffa

Actually the segmentation guidelines of the old HTB don't match IAHLT, and as you noted there are feature differences. In fact, the HTB in the UD repo hasn't been valid since 2018, so it's in legacy status and doesn't match what's in IAHLT, which is newer. However we do have a revised version of HTB which is valid (at least as of UD2.9) and matches the standards in IAHLTwiki pretty closely. You can find it here:

https://github.com/IAHLT/UD_Hebrew

That should allow for good joint results, combined with https://github.com/universalDependencies/UD_Hebrew-IAHLTWiki

amir-zeldes avatar Aug 25 '22 21:08 amir-zeldes

That's interesting. Is there a possibility of upgrading the UD version of HTB to the revised version? I haven't been following developments for those treebanks at all.

AngledLuffa avatar Aug 25 '22 21:08 AngledLuffa

I think the earliest that could happen would be in November, based on the guidelines here:

http://quest.ms.mff.cuni.cz/udvalidator/cgi-bin/unidep/validation-report.pl

The preamble states that after 4 years in legacy status, an invalid legacy TB will be excluded from the release, so perhaps this would be a chance to propose switching to that fork.

amir-zeldes avatar Aug 26 '22 13:08 amir-zeldes

I should be able to make this happen tomorrow.

AngledLuffa avatar Aug 31 '22 05:08 AngledLuffa

Fantastic, just let me know once it's available - it's too late for something I needed for an author response, but could still go into a camera ready version... :)

amir-zeldes avatar Aug 31 '22 20:08 amir-zeldes

Would have made it a bit more of a priority if I knew there was a clock. It takes a few hours to go from 0 to models, and I was using my time (and my GPU time) to make some progress on the sentiment classifier and parser.

Anyway, I just need to finish up the depparse and it will be ready. (This has to happen after POS because we use the tagger to use predicted tags.) Are there any other models which would be useful, such as NER, constituency, or sentiment?

AngledLuffa avatar Sep 01 '22 02:09 AngledLuffa

Ok, if you use stanza dev branch, it should be available now.

I'm gonna make a new release soon with it as default - just need to retrain a few models first

AngledLuffa avatar Sep 01 '22 05:09 AngledLuffa

Oh wow, thank you! And no worries about the clock, this was super fast! We can include numbers from this model in our upcoming paper now.

@ivrit @yifatbm @nlhowell @AvnerAlgom - note this means that there will be a joint wiki+HTB Hebrew Stanza model using the new tokenization out of the box

amir-zeldes avatar Sep 01 '22 14:09 amir-zeldes

This is part of stanza as of 1.4.1

Do you plan on taking over maintenance of the original HTB (or at least requesting it) when UD 2.11 comes out? Currently this is kind of ad-hoc, needing a couple different items downloaded first before building the models.

AngledLuffa avatar Sep 14 '22 19:09 AngledLuffa

This would depend on how the community and previous maintainers feel, but if they don't have the resources to maintain the older fork then yes, I would be willing to maintain the newer one. Having it consistent with the new Wiki-based corpus would be a big plus for Hebrew NLP, and there are more corpora coming for Hebrew in the same scheme.

amir-zeldes avatar Sep 15 '22 15:09 amir-zeldes

Gotcha. Perhaps a meaty PR with the updates would at least get the other repo to use the same annotation scheme. Either way, it would make it easier to do things like the combined models

AngledLuffa avatar Sep 15 '22 15:09 AngledLuffa

Agreed - I'll bring it up in the countdown to the Nov. release.

amir-zeldes avatar Sep 15 '22 19:09 amir-zeldes

@amir-zeldes did you wind up taking over the other Hebrew dataset, or otherwise having it updated to the better tagging scheme?

Also, do you have anything else that would be useful for your usage of Stanza? An NER model, for example, or perhaps a constituency parser built out of SPMRL?

AngledLuffa avatar Feb 28 '23 02:02 AngledLuffa

@AngledLuffa There is an updated HTB but it's not perfect. I think Amir has it over his repository. Also, we at IAHLT want to release a fully open version of the new Hebrew dataset with NE-annotations. Currently we only have about 80% which underwent two rounds of QA. I am happy to send it over for training, if that helps.

ivrit avatar Feb 28 '23 08:02 ivrit

I can wait for it to be released. This thread is one of the first times people have mentioned using the Hebrew models. Thanks!

AngledLuffa avatar Feb 28 '23 09:02 AngledLuffa

Hi @AngledLuffa , no, so far the HTB maintainers have opted to leave it as is, but if that changes I can let you know. My own fork, with tokenization etc. matching the Wiki corpus is still available here and is valid based on the UD validator (or was until recently, it's a moving target..) I can also keep updating it, but maybe I should wait and see what happens with HTB in the next UD release. In any case you should be able to get a good combined model going using that repo + the official UD IAHLTwiki - in fact, we have published scores for that using Stanza in this paper.

Adding NER support would be fantastic, but as Noam mentioned the data is not yet publicly available. A portion of it overlaps the entire IAHLTwiki corpus, so my plan is to merge the NER annotations into that using the same format at English GUM (i.e. the CorefUD/Universal Anaphora conllu format).

amir-zeldes avatar Feb 28 '23 15:02 amir-zeldes

I don't suppose constituency trees are in the plans for the new resource? I may be one of the last people to care about that 🤷

Adding NER support would be fantastic

There's other Hebrew NER data out there. For example: https://github.com/OnlpLab/NEMO-Corpus

AngledLuffa avatar Feb 28 '23 17:02 AngledLuffa

There is a Hebrew constituent TB over the same material as HTB, but it is very old and I wouldn't count on the tokenization matching either of the UD versions.

The NEMO data is the same text as the old-tokenization HTB (all come from the same ~1990 Ha'aretz newspaper data), though I think it also includes a version which stretches the annotations to the nearest MWT and could probably be projected somehow to either corpus. But the IAHLT standard is different and IMO much better, so I would wait for that (NEMO has some very strange practices such as excluding 'of' PP modifiers, so in ORG or PER like "the State Department/Foreign Minister of Canada" it will always leave out "of Canada", making it much less useful for applications).

amir-zeldes avatar Feb 28 '23 21:02 amir-zeldes

Gotcha, thanks for the insight

AngledLuffa avatar Feb 28 '23 22:02 AngledLuffa

Did you ever make any progress unifying the different HE treebanks in the UD umbrella? It would make rebuilding the models simpler in the long run.

Also, LMK if there's a need for NER and a dataset I should use

AngledLuffa avatar Jun 22 '24 22:06 AngledLuffa

@AngledLuffa we still did not sync the two UD's, but we have a new (soon to be) publicly available Hebrew UD+NER dataset. I'll it to you via email until Amir push it on the next UD release.

ivrit avatar Jun 26 '24 16:06 ivrit

Great, thanks! What should I do regarding this dataset and the existing UD datasets? Currently the default Hebrew models for Stanza are built from "UD_Hebrew-IAHLTwiki" and the github.com:IAHLT/UD_Hebrew.git repo. If I add the new data you just sent me, does that overlap either of those sources, or should I use all three?

AngledLuffa avatar Jun 26 '24 17:06 AngledLuffa

OK, so in the compressed file I sent you there are additional 4.7k annotated sentences, basically taken from here, only that we (read: Amir) did some extra cleaning and separated it to train/dev/set. So it would be nice to see whether a combined model yields considerably better results.

ivrit avatar Jun 26 '24 17:06 ivrit

Sounds good. So basically all three should be disjoint, and I should train with all three and report the results on the various dev & test sets?

And eventually the new data will be integrated with UD, but the git repo I just linked to is not expected to be part of UD any time soon?

AngledLuffa avatar Jun 26 '24 18:06 AngledLuffa

wait... went back to take a look, and your message said "superset". so one of the other two datasets is also part of what you sent me?

AngledLuffa avatar Jun 27 '24 00:06 AngledLuffa

I think that it would simpler if I just add here the disjoint new dataset. Please find attached the new IAHLTKnesset data with the splits. If I were you, I would train on both UD_Hebrew-IAHLTwiki + and the attached - as the schema is exactly the same. It would be great if could combine it with other datasets and schemas, but it would require more work. UD_Hebrew-IAHLTknesset.zip

ivrit avatar Jun 27 '24 08:06 ivrit

Overall I think the results are promising for using the new Hebrew dataset alongside the other two datasets. Tokenization, MWT, Lemmas are all in the same general score range. As an example of how the new data allows for broader coverage, here are some POS and depparse results

For POS, the scores are about the same on the original dev & test sets (the IAHLTwiki dataset), but the coverage is clearly better on Knesset:

pos

orig model dev
   UPOS    XPOS  UFeats AllTags
  97.36   97.36   93.35   92.39

orig model test
  97.39   97.39   92.03   91.31

new model dev
  97.39   97.32   93.32   92.28

new model test
  97.41   97.42   92.04   91.20

orig model new dataset test
  96.63   95.85   82.93   80.53

new model new dataset test
  97.47   96.90   92.78   90.33

For depparse, I would again say the scores are similar (maybe a bit of a dip from adding the new data), but the coverage on the new dataset's test set is clearly better.

depparse

orig model dev
  UAS   LAS  CLAS  MLAS  BLEX
94.25 92.20 88.99 88.39 88.99

orig model test
94.01 91.65 88.31 87.37 88.31

new model dev
94.18 92.22 89.14 88.50 89.14

new model test
94.02 91.56 87.88 87.06 87.88

orig model new dataset test
89.68 86.46 82.00 81.03 82.00

new model new dataset test
91.99 89.57 85.79 85.16 85.79

I hadn't used the dev set from the new dataset in any way. I'm not sure if it would make more sense to put all three dev sets together, just use one dev set as I am currently doing, or perhaps even use the dev sets from the dataset which isn't the primary scoring metric as additional training data.

At any rate, what do you think? Make the models with the third dataset the default HE models?

In terms of availability of this dataset, is it going to be part of UD 2.15? That would make it easier to maintain the models going forward. Even better would be if the IAHLT standard for the older HE dataset becomes part of UD somehow.

AngledLuffa avatar Jul 06 '24 07:07 AngledLuffa