transformers
transformers copied to clipboard
`clean_up_tokenization` too many false positives
System Info
The method PreTrainedTokenizerBase.clean_up_tokenization attempts to fix some quote marks, but breaks quite a lot of the time.
I'm testing various tokenization techniques searching for the holy grail of original == decode(encode(original))
Looping through docs in OpenWebText, here's some of the results:

The fix is pretty easy: instead of doing text.replace(" 's", "'s"), do re.sub(r" 's\b", "'s", text).
I note that this has already been logged, and the AUTO CLOSED here: https://github.com/huggingface/transformers/issues/6164
Please let me know if you would like to hear my thoughts about auto closing bugs :)
Who can help?
@ArthurZucker
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
For any tokenizer tok, note the output of:
tok.decode(tok("asking why 'my people' wanted").input_ids)
Expected behavior
Output should be "asking why 'my people' wanted", not "asking why'my people' wanted"
Hey! Thanks for pointing this out! I do agree with you on this one, I am also guessing that if you have he said that 'revenge' was over the same problem will occur.
Currently this is what is being used :
out_string = (
out_string.replace(" .", ".")
.replace(" ?", "?")
.replace(" !", "!")
.replace(" ,", ",")
.replace(" ' ", "'")
.replace(" n't", "n't")
.replace(" 'm", "'m")
.replace(" 's", "'s")
.replace(" 've", "'ve")
.replace(" 're", "'re")
)
So a lot of patterns are going to get swallowed up by this.
cc @Narsil is it not too breaking to switch to the re pattern? The same thing happens in wordpiece.rs
holy grail of original == decode(encode(original))
Bloom tokenizer achieves this if you're looking for it. To the exception that there's a very old default: https://github.com/huggingface/transformers/pull/20846
@ArthurZucker I feel really bad about making changes to such old things. It's been in use for so long I don't feel it's a bug anymore but a feature. Allowing users to disengage from the cleanup (and maybe make it a default for newly created tokenizers) is OK, but modifying existing behavior, I don't feel good about (in theory I like it, but I'm fairly confident it will blow up as soon as released, and if it blows up a little bit later, then we'll be in a worse position even since you have 2 different behavior unable to find a good compromise).
My take is that the replace is bad, but the cleanup itself is bad and should just be not used anymore (and for BC we should just modify future behavior, not the current one).
Yes this method seems like a good candidate for the great deprecation, and we can see if we want to officially support something better.
I appreciate the reluctance to 'move fast and break things' - nice to see :)
As a user finding his way around the Hugging Face packages, it did strike me as odd that there was extra magic in the transformers tokenizer that wasn't in the underlying tokenizers tokenizer. It certainly makes troubleshooting difficult, so my humble vote would go toward deprecating.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.