stanza Proiel parser exhibits odd behaviour with respect to punctuation

Describe the bug

If there is a comma in the parsed sentence, the PROIEL model:

a) does not tokenize the comma, it just bundles it with the preceding word. The lemma is affected similarly. b) if the comma is space-delimited, it does unpredictable (to me!) things up to and including tagging it as a verb with a lemma of ὁράω.

The fullstop/period is correctly tokenized, but is still never identified as punctuation. There does not seem to be any POS tag corresponding to punctuation emitted by the PROIEL model; the full list of tags on parsing a corpus is ADJ ADP ADV AUX CCONJ DET INTJ NOUN NUM PRON PROPN SCONJ VERB .

To Reproduce

import stanza
perseus = stanza.Pipeline('grc', processors='tokenize,pos,lemma', package="perseus")
proiel = stanza.Pipeline('grc', processors='tokenize,pos,lemma', package="proiel")

sent = "Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος." # John 1:1, Nestlé 1904 edition of the New Testament

print(perseus(sent))
# Correct output (only relevant tokens shown here)
# {
#     "id": 6,
#     "text": ",",
#     "lemma": ",",
#     "upos": "PUNCT",
#     "xpos": "u--------",
#     "start_char": 18,
#     "end_char": 19
# }

# [...]

# {
#     "id": 20,
#     "text": ".",
#     "lemma": ".",
#     "upos": "PUNCT",
#     "xpos": "u--------",
#     "start_char": 69,
#     "end_char": 70
# }


print(proiel(sent))
# Comma not separated from the preceding word
# {
#     "id": 5,
#     "text": "Λόγος,",
#     "lemma": "Λόγος,",
#     "upos": "PROPN",
#     "xpos": "Ne",
#     "feats": "Case=Nom|Gender=Masc|Number=Sing",
#     "start_char": 13,
#     "end_char": 19
# }

# Fullstop parsed as an adverb
# {
#     "id": 6,
#     "text": ".",
#     "lemma": ".",
#     "upos": "ADV",
#     "xpos": "Df",
#     "start_char": 69,
#     "end_char": 70
# }

sent_with_space_before_comma = "Ἐν ἀρχῇ ἦν ὁ Λόγος , καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν , καὶ Θεὸς ἦν ὁ Λόγος."
print(proiel(sent_with_space_before_comma))

# Comma is now a token by itself, but is not identified as punctuation.
# {
#     "id": 6,
#     "text": ",",
#     "lemma": "ἤ",
#     "upos": "CCONJ",
#     "xpos": "C-",
#     "start_char": 19,
#     "end_char": 20
# },

# The second comma is also wrong, but different.
# {
#     "id": 1,
#     "text": ",",
#     "lemma": "ὁ",
#     "upos": "NOUN",
#     "xpos": "Nb",
#     "feats": "Case=Voc|Gender=Masc|Number=Sing",
#     "start_char": 50,
#     "end_char": 51
# },

Results from those commas that were somehow parsed individually, from parsing the text of the Nestlé 1904 edition of the New Testament. They have been passed through sort -u to deduplicate them.

Text	Lemma	 POS	 Features
,	ἤ	INTJ	None
,	Ἤ	PROPN	None
,	Ἤ	PROPN	Number=Sing
,	ὁ	NOUN	Case=Acc|Gender=Masc|Number=Sing
,	ὁ	NOUN	Case=Dat|Gender=Fem|Number=Plur
,	ὁ	NOUN	Case=Dat|Gender=Fem|Number=Sing
,	ὁ	NOUN	Case=Dat|Gender=Masc|Number=Sing
,	ὁ	NOUN	Case=Gen|Gender=Fem|Number=Sing
,	ὁ	NOUN	Case=Gen|Gender=Masc|Number=Sing
,	ὁ	NOUN	Case=Nom|Gender=Fem|Number=Plur
,	ὁ	NOUN	Case=Nom|Gender=Fem|Number=Sing
,	ὁ	NOUN	Case=Nom|Gender=Masc|Number=Sing
,	ὁ	NOUN	Case=Voc
,	ὁ	NOUN	Case=Voc|Gender=Fem|Number=Sing
,	ὁ	NOUN	Case=Voc|Gender=Masc|Number=Sing
,	ὁ	NOUN	Case=Voc|Number=Sing
,	ὁ	NOUN	Gender=Fem|Number=Sing
,	ὁ	NOUN	Gender=Masc|Number=Sing
,	ὁ	NOUN	None
,	ὁ	NOUN	Number=Sing
,	ὁράω	VERB	Aspect=Perf|Mood=Imp|Number=Sing|Person=2|Tense=Past|VerbForm=Fin|Voice=Act
,	ὁράω	VERB	Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Tense=Past|VerbForm=Fin|Voice=Act
,	ὁράω	VERB	Mood=Imp|Number=Sing|Person=2|Tense=Past|VerbForm=Fin|Voice=Act
,	ὁράω	VERB	Mood=Imp|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act
,	ὁράω	VERB	Mood=Ind|Number=Sing|Person=2|Tense=Past|VerbForm=Fin|Voice=Act
,	ὁράω	VERB	Number=Plur|Tense=Pres|VerbForm=Fin|Voice=Act
,	ὅς	ADJ	Case=Dat|Degree=Pos|Gender=Masc|Number=Sing
,	ὅς	PRON	Case=Dat|Gender=Masc|Number=Sing|Person=1|PronType=Prs
,	ὅς	PRON	Case=Dat|Gender=Masc|Number=Sing|Person=2|PronType=Prs

## Fullstops
.	.	ADV	None
.	ἤ	SCONJ	None
.	ὁ	PRON	Case=Dat|Gender=Masc|Number=Plur|PronType=Prs
.	ὁ	PRON	Case=Dat|Gender=Masc|Number=Sing|Person=2|PronType=Prs
.	ὁ	PRON	Case=Dat|Gender=Masc|Number=Sing|Person=3|PronType=Prs
.	ὁ	PRON	Case=Dat|Gender=Masc|Number=Sing|PronType=Prs
.	ὁ	PRON	Case=Dat|Gender=Masc|Number=Sing|PronType=Rel

Expected behavior As Perseus: commas should be tokenised separately from the preceding word; both commas and fullstops should be annotated as punctuation

Environment (please complete the following information):

OS: Ubuntu 22.04
Python version: 3.10.9 installed via miniconda
Stanza version: 1.6.1 Additional context Running under Jupyter within PyCharm

Nov 27 '23 17:11 pseudomonas

Certainly this sucks, but the problem here is with the training data, and I'm not sure how we can fix it. The PROIEL dataset has zero (!) instances of either commas or periods.

One thing I just found is that the Perseus dataset has commas and a period analog which appears to be halfway up the line of text compared to a US period. For example, the first sentence looks like

# text = ἐρᾷ μὲν ἁγνὸς οὐρανὸς τρῶσαι χθόνα, ἔρως δὲ γαῖαν λαμβάνει γάμου τυχεῖν·

It would appear the XPOS tags are not remotely similar, but perhaps you could take a look to see if the general annotation quality is similar. Are the tokenization, lemmatization, dependency standards the same... we could probably mix the two if they are, or maybe you'd just get better results from switching to Perseus

Nov 27 '23 20:11 AngledLuffa

The PROIEL dataset has zero (!) instances of either commas or periods.

I wondered if this had been the case, indeed. Which is odd given that PROIEL included biblical edition text.

I haven't looked into how feasible it would be to interconvert the treebanks and train on a mixture of both the sources. Or to use one of the sources as a pre-training task but not a fine-tuning task, assuming that the stanza models behave like other language models in this regard. So far I've used them as black box algorithms.

Nov 27 '23 20:11 pseudomonas

I am indeed now using Perseus — but especially since PROIEL is the default package in stanza for Ancient Greek, I thought this was worth noting.

Nov 27 '23 21:11 pseudomonas

@AngledLuffa Reading the docs at https://stanfordnlp.github.io/stanza/new_language.html it looks like unlabelled text is only good for improving NER/Sentiment/Constituency parsing and not for any of the tasks I'm using (tokenize, lemma, POS, depparse). Is that actually the case?

Nov 27 '23 21:11 pseudomonas

I would say that if the other annotations are of similar formalities, they would wind up benefitting the model by giving it more words it knows about and/or examples of unusual phenomenon.

The small things I need to do in a short amount of time are kinda adding up, but long term I do think switching the default to Perseus and then exploring using data from both to make a "combined" model is probably the best approach here.

Nov 27 '23 21:11 AngledLuffa

I feel like in the long run it would be nice to be able to put a standard-architecture language model in there and have the stanza training script do the fine-tuning on that. I'm thinking especially of things like dbamman/latin-bert here (Latin is also a language that I need to support).

Nov 27 '23 22:11 pseudomonas

We actually do exactly that for some languages, with the default-accurate language model, although the transformers didn't fit in with the tokenizer or lemmatizer architecture easily. I even found a transformer for Ancient Greek:

https://huggingface.co/pranaydeeps/Ancient-Greek-BERT

(feel free to suggest other options)

If you want, I can give that a try with Ancient Greek, but again, I'm up to my ears in small things that need doing and can't really commit to doing it for a few weeks.

Nov 27 '23 22:11 AngledLuffa

other options:

https://huggingface.co/lgessler/microbert-ancient-greek-m https://huggingface.co/lgessler/microbert-ancient-greek-mx https://huggingface.co/lgessler/microbert-ancient-greek-mxp

https://huggingface.co/altsoph/bert-base-ancientgreek-uncased

These couple have no description in the model card, which is kinda sus:

https://huggingface.co/niksss/Ancient-Greek-BERT-finetuned-wikitext2 https://huggingface.co/Sonnenblume/bert-base-uncased-ancient-greek-v4

Nov 27 '23 22:11 AngledLuffa

Well, I could give it a whirl if you can point me at docs on how to do the fine-tuning and plumbing it into the system; this is stuff I need for work so I feel like I should at least try to contribute!

I'm aware of the microbert models; they're nice and fast to train (and they're what I'm working on using for Coptic), so if they work, this would be generally applicable to most of the stanza languages.

Nov 27 '23 22:11 pseudomonas

Basically you just need to go through the retraining instructions with the flags --use_bert --bert_model ...

I actually found in some limited experiments that finetuning the transformer itself for POS didn't help given the complexity of the inference head we use. We've had some recent success anyway finetuning for constituency parsing or coref with LoRA or with careful experimentation for the finetuning method. However, the calendar for expanding that to other models is "after I get out from under this crushing TODO list" or "after I can scam an undergrad @Jemoka into doing it"

Nov 28 '23 02:11 AngledLuffa

Hello, I am that undergrad and I'd love to look into it this weekend. @pseudomonas, @AngledLuffa do you think I can be more helpful starting with—

trying to create a "combined" model with the old architecture combining both datasets—which should give very good performance, but we won't get a Bert out of the system?
explore the transformer Bert embedding situation applying our work for LoRA + Coref for Greek and try to fine-tune a transformer to downstream Greek tasks?

As @AngledLuffa said, Bert support is pretty good but I don't think has been done for this area yet. Though, if one of the two packages work, perhaps it will be more interesting to look into training/LoRAing a Transformer on the task instead of getting a better model simply by combining the two sets.

Nov 28 '23 05:11 Jemoka

@Jemoka I was thinking refactoring the usage of Peft and giving it a try on the POS or depparse would both be interesting and useful, especially once we wrap up the Coref usage of Peft

Certainly as a baseline, switching to Perseus and experimenting with a few of the above models to see which works best would give a better model for short term usage

Nov 28 '23 05:11 AngledLuffa

do you think I can be more helpful starting with

Combining the treebanks seems like, if it can be done, it will provide benefits; and a BERT can presumably be added on top of that at a later point. But I don't know how compatible the annotation guidelines of the two projects are.

Nov 28 '23 10:11 pseudomonas

@Jemoka I think in terms of improving performance longer-term across Stanza, being able to leverage BERT-integration would be good. I'm probably going to try @AngledLuffa's suggestion https://github.com/stanfordnlp/stanza/issues/1311#issuecomment-1828961531 in any case. I'm not sure how this corresponds (either in terms of performance or in terms of mechanism) to fine-tuning a BERT to perform the task directly.

Nov 28 '23 10:11 pseudomonas

Sounds good. @pseudomonas Feel free to start with the Bert work there, and I can start on the PEFT a large model end that @AngledLuffa mentioned and do Greek POS first as a test case. And hopefully you can end up with a good model in the short term and we can release an adapter that performs even better in the long term.

LMK if you run into anything with Bert tuning.

Nov 28 '23 16:11 Jemoka

@AngledLuffa if I'm training a model and the training is interrupted, what's the command-line flags for "resume training starting with this saved checkpoint?"

Dec 01 '23 10:12 pseudomonas

If it's giving you the message that the model already exists, you can overwrite the existing model with --force. I haven't added resuming from a checkpoint for POS because it only takes a couple hours to retrain the whole model anyway. It's somewhere on the TODO list, though...

Dec 01 '23 16:12 AngledLuffa

I'll have results later this morning for the Perseus POS trained on a few different Ancient Greek transformers. I can also do the same thing for depparse, and there's even time to include those models in the upcoming 1.7.0 release. I don't have time over this weekend to built a pretrained charlm (probably from something like https://figshare.com/articles/dataset/The_Diorisis_Ancient_Greek_Corpus/6187256) but that can be an action item for later.

Dec 01 '23 16:12 AngledLuffa

@AngledLuffa I will start over the weekend to PEFT for POS and depparse taking a hopefully good pretrained Bert as a starting point. Once you explore some ancient greek transformers don't hesisate to lmk what you would recommend; I will also dig into this a little later on my own.

Dec 01 '23 17:12 Jemoka

it took my little computer over a day to reproduce the benchmark, so I might try running the BERT one on my work's cluster with GPUs…

Dec 01 '23 17:12 pseudomonas

yes, running on GPUs would make this process a lot faster; also, the upcoming work of PEFT (in theory, results/benchmarks TBD) should also make inference a smidge faster because its multiplying less parameters.

Dec 01 '23 19:12 Jemoka

So far, I would say the pranaydeeps model improves scores the most, but I will give a full report after a few more model trainings

Dec 01 '23 20:12 AngledLuffa

As listed above, there's a few Ancient Greek transformers available on HF. Here are the dev scores on the POS & depparse tasks

    #    Model           POS        Depparse LAS
    # None              0.8812       0.7684
    # Microbert M       0.8883       0.7706
    # Microbert MX      0.8910       0.7755
    # Microbert MXP     0.8916       0.7742
    # Pranaydeeps Bert  0.9139       0.7987

https://huggingface.co/altsoph/bert-base-ancientgreek-uncased

could not use because of this error:

https://huggingface.co/altsoph/bert-base-ancientgreek-uncased/discussions/2

So based on those scores, I made the pranaydeeps model the default_accurate package. That will be available as part of the 1.7.0 release... I suppose we can even make a sneak peak of that available now

https://test.pypi.org/project/stanza/1.7.0/

Dec 02 '23 06:12 AngledLuffa

You will probably want to use a GPU for the default_accurate package, btw.

My takeaway from the rest of this thread is that there are a few separate directions for improvement still:

If possible, combine Perseus and Proiel training data to make a single model. I haven't heard anything definitive about whether or not that's a good idea in that issue I linked above
It sounds like people genuinely do have a use case for checkpointing of the models which take only a couple hours to train on a GPU, so I should make that available for the next version
PEFT integration with POS and depparse would hopefully get some decent improvements

At any rate, I don't think any of these are immediate TODOs, so hopefully we've improved the situation enough for now and we can leave the issue open in anticipation of future improvements.

Dec 02 '23 06:12 AngledLuffa

Your baseline scores (Model==None) are rather higher than those on https://stanfordnlp.github.io/stanza/performance.html assuming that POS is XPOS rather than UPOS; that page has UPOS = 92.41; XPOS = 85.13 ; LAS=73.97.

Dec 02 '23 07:12 pseudomonas

Those are test scores, these are dev scores. Didn't seem fair to pick a model based on how good they do on the test set

The POS score is weighted of upos, xpos, and feats

Dec 02 '23 07:12 AngledLuffa

@AngledLuffa I wonder if we trained the model with our new fangles EoS Punct augmentation it will also do better even on the persus dataset

Dec 03 '23 20:12 Jemoka

I believe that is the default now

On Sun, Dec 3, 2023, 12:27 PM Houjun Liu @.***> wrote:

@AngledLuffa https://github.com/AngledLuffa I wonder if we trained the model with our new fangles EoS Punct augmentation it will also do better even on the persus dataset

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1311#issuecomment-1837590627, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWPTQ6IV4EJD2YO2IULYHTODXAVCNFSM6AAAAAA74MYBL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZXGU4TANRSG4 . You are receiving this because you were mentioned.Message ID: @.***>

Dec 03 '23 20:12 AngledLuffa

I've found a different but related issue with both perseus and proiel parsers, which is that they perform incredibly badly with accents stripped out (they do things like processing definite articles and the most common adverbs as nouns).

Is there a way of using the data augmentation that is used to make them tolerant of line-final punctuation to make them tolerant of absence of accents? My use-case for the parsers is processing manuscripts that lack accents.

The code I'm using is just

def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
                   if unicodedata.category(c) not in ('Mn'))

though this might want some refinement so that iotas-subscript are randomly either removed or replaced by a normal iota.

I'm also wondering about whether the data being unicode-decomposed before training would help it generalise.

Jan 11 '24 10:01 pseudomonas

I can see how that would be a problem. However, how correct will be able to make it if we use a pretrained embedding or even a transformer? The tokens / tokenizer will have the accents as well, I would think. What about cases where multiple different texts with accents map to the same text without accents?

Nevertheless, if you think it will help, I don't see any reason we can't provide a model like that using the augmentation mechanism.

Jan 11 '24 15:01 AngledLuffa

stanza stanza copied to clipboard

Proiel parser exhibits odd behaviour with respect to punctuation

stanza
stanza copied to clipboard