stanza icon indicating copy to clipboard operation
stanza copied to clipboard

Mismatched token output using custom stanza tokenizer

Open yilunzhu opened this issue 3 years ago • 3 comments

Describe the bug I trained a custom stanza tokenizer and mwt on UD_English-GUM. When using the tokenizer & mwt for inference, the tokenizer changed the surface form of the word. For example, the word "subcontractor's" is tokenized as "subcontratrr 's" in the sentence:

"The college is a state-funded uh uh remodel, and on state-funded remodels, we're required to pay prevailing wages. Uh prevailing wages, that, um, that indicate different levels of agility, of the different men working. And so, uh a lot of the crews, uh, like Mitchell, who have people that work under him, around town, in regular situations, come to the people like me, and ask us to do payroll for them. When we do the payroll for them, we state to them up front, that uh, we will pay the payroll, we will make the deductions, and then the employer contribution, which is approximately twenty-six percent, over and above the hourly wage, is also deducted, from the um subcontractor's check."

To Reproduce Steps to reproduce the behavior:

  1. Train the tokenizer on UD_English-GUM
  2. Use the saved en_gum_tokenizer.pt model on other plain text

Expected behavior subcontractor's -> subcontractor 's

Environment (please complete the following information):

  • OS: CentOS 7
  • Python version: Python 3.7.11 from Anaconda
  • Stanza version: 1.3.0

Additional context I have also tried the newest stanza version 1.4.2, while this issue is still there.

yilunzhu avatar Sep 21 '22 00:09 yilunzhu

Can confirm, this currently happens with the model we trained from GUM + GUMReddit

import stanza
pipe = stanza.Pipeline("en", package="gum", processors="tokenize,mwt")
pipe("When we do the payroll for them, we state to them up front, that uh, we will pay the payroll, we will make the deductions, and then the employer contribution, which is approximately twenty-six percent, over and above the hourly wage, is also deducted, from the um subcontractor's check.")

AngledLuffa avatar Sep 22 '22 19:09 AngledLuffa

Unfortunately, I think the timeline for fixing this has to be a couple weeks from now at least. Lots of stuff on my plate and I don't think anyone else will be able to look at it

AngledLuffa avatar Sep 30 '22 05:09 AngledLuffa

I think this is now fixed, actually. I implemented a change where, at training time, it reviews all of the possible MWT expansions. If all tokens are expansions of the words that comprise the MWT, at inference time it tries to rebuild the words using the raw text rather than the seq2seq

AngledLuffa avatar May 09 '24 05:05 AngledLuffa