stanza icon indicating copy to clipboard operation
stanza copied to clipboard

French GSD model considers word and punctuation as the same token

Open pvsimoes opened this issue 4 years ago • 8 comments

Describe the bug In some cases, and since the 1.2.0 update, the French GSD model considers the last word of the phrase and the final punctuation as the same token.

To Reproduce

nlp_fr = stanza.Pipeline('fr', package='gsd', processors='tokenize')
phrase = "C'est mon ami."
doc = nlp_fr(phrase)

for s in doc.sentences:
    for t in s.tokens:
        print(t.text)

Observed Output:

C'
est
mon
ami.

Expected behavior I expected the final output to consider the last word and the punctuation as two separate tokens, similarly to what happens with some other phrases (see below):

nlp_fr = stanza.Pipeline('fr', package='gsd', processors='tokenize')
phrase = "Je n'apprends pas le français."
doc = nlp_fr(phrase)

for s in doc.sentences:
    for t in s.tokens:
        print(t.text)

Observed Output:

Je
n'
apprends
pas
le
français
.

Environment:

  • OS: MacOS
  • Python version: 3.7.3, also observed in Python 3.6.9 on Google Colab
  • Stanza version: 1.2.0

pvsimoes avatar Feb 03 '21 15:02 pvsimoes

That's disappointing. This particular example worked in the previous version, at least. I'll look into it

AngledLuffa avatar Feb 03 '21 18:02 AngledLuffa

Alright, I don't really have a good explanation looking at the data or the model. It isn't obviously mistrained, as it still scores high on the test set. The word "ami" shows up at the end of a sentence in the training data a couple times, once like "ami." and once like "ami !". The model should have learned to tokenize "ami." correctly from what I can see.

On the other hand, there are a lot of instances of "[a-zA-Z]+." as a single token, so maybe this is just it learning a bad rule from that training data. Do you have any other examples where it is doing this? I could put them all in a list of tokenization fixes and retrain the model using that file.

Another option would be to combine a few treebanks into one dataset. One issue with that is the "features" detected in the POS tool are not consistent between datasets. It seems FTB, FQB, and Sequoia are consistent, but GSD (this dataset), ParTUT, and PUD are significantly different.

AngledLuffa avatar Feb 04 '21 00:02 AngledLuffa

Thank you for the answer. I have found another two examples, seen below.

Example 1: Input: J’ai une douleur ici. Observed Output:

J’ai
une
douleur
ici.

Example 2: Input: Excusez-moi. Obrserved Output: Excusez-moi.

I have also found that adding a new word to this last example will make it return a correct output: Input: Pardon, excusez-moi. Output:

Pardon
,
excusez-moi
.

pvsimoes avatar Feb 06 '21 12:02 pvsimoes

These are both weird cases. "ici." as the last word of a sentence shows up in the existing training data, so idk why it's not processing that correctly. "J'ai" shows up many times and is tokenized each time in the training data, so that's also a weird error to make. I believe "excusez-moi" is supposed to be tokenized into two words by the GSD treebank standard, but I know epsilon French and have no idea if this is different from the other instances where "-moi" shows up and is tokenized separately.

I will add some fake data to the GSD treebank and see if retraining the tokenizer cleans up those examples. Please let us know as you find other wrong examples.

AngledLuffa avatar Feb 07 '21 21:02 AngledLuffa

So, retraining with a couple extra sentences didn't do anything for those exact cases, but I did notice the following:

"J'ai une douleur ici. J'ai une douleur ici."

DOES get correctly tokenized the first time. Apparently the continuing text is necessary for it to realize that there might be a tokenization there...

AngledLuffa avatar Feb 08 '21 19:02 AngledLuffa

We still need to figure out the best way to generalize this finding, but we realized that the problem is likely because the UD training sets are one giant monolithic paragraph instead of nicely split up. If I randomly split up the dataset into a few different paragraphs, it now works on the sentences you've sent us. As an added bonus, I added a little extra text to compensate for the unicode ' which was causing J'ai to be stuck together instead of tokenized.

http://nlp.stanford.edu/~horatio/fr_gsd_tokenizer.pt

You drop that in stanza_resources/fr/tokenize/gsd.pt

although it should be pointed out that this will be overwritten if you download("fr") again

AngledLuffa avatar Feb 09 '21 05:02 AngledLuffa

It is successfully working now as expected, even if it is only a workaround. Thanks for the replies and solution.

pvsimoes avatar Feb 18 '21 09:02 pvsimoes

Glad to hear it. We made a more principled, less hacky version of the change and will apply it to future model releases.

AngledLuffa avatar Feb 18 '21 20:02 AngledLuffa