transformers `clean_up_tokenization` too many false positives

trafficstars

System Info

The method PreTrainedTokenizerBase.clean_up_tokenization attempts to fix some quote marks, but breaks quite a lot of the time.

I'm testing various tokenization techniques searching for the holy grail of original == decode(encode(original))

Looping through docs in OpenWebText, here's some of the results:

The fix is pretty easy: instead of doing text.replace(" 's", "'s"), do re.sub(r" 's\b", "'s", text).

I note that this has already been logged, and the AUTO CLOSED here: https://github.com/huggingface/transformers/issues/6164

Please let me know if you would like to hear my thoughts about auto closing bugs :)

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

For any tokenizer tok, note the output of:

tok.decode(tok("asking why 'my people' wanted").input_ids)

Expected behavior

Output should be "asking why 'my people' wanted", not "asking why'my people' wanted"

Mar 08 '23 04:03 davidgilbertson

Hey! Thanks for pointing this out! I do agree with you on this one, I am also guessing that if you have he said that 'revenge' was over the same problem will occur. Currently this is what is being used :

        out_string = (
            out_string.replace(" .", ".")
            .replace(" ?", "?")
            .replace(" !", "!")
            .replace(" ,", ",")
            .replace(" ' ", "'")
            .replace(" n't", "n't")
            .replace(" 'm", "'m")
            .replace(" 's", "'s")
            .replace(" 've", "'ve")
            .replace(" 're", "'re")
        )

So a lot of patterns are going to get swallowed up by this. cc @Narsil is it not too breaking to switch to the re pattern? The same thing happens in wordpiece.rs

Mar 08 '23 09:03 ArthurZucker

holy grail of original == decode(encode(original))

Bloom tokenizer achieves this if you're looking for it. To the exception that there's a very old default: https://github.com/huggingface/transformers/pull/20846

@ArthurZucker I feel really bad about making changes to such old things. It's been in use for so long I don't feel it's a bug anymore but a feature. Allowing users to disengage from the cleanup (and maybe make it a default for newly created tokenizers) is OK, but modifying existing behavior, I don't feel good about (in theory I like it, but I'm fairly confident it will blow up as soon as released, and if it blows up a little bit later, then we'll be in a worse position even since you have 2 different behavior unable to find a good compromise).

My take is that the replace is bad, but the cleanup itself is bad and should just be not used anymore (and for BC we should just modify future behavior, not the current one).

Mar 08 '23 10:03 Narsil

Yes this method seems like a good candidate for the great deprecation, and we can see if we want to officially support something better.

Mar 08 '23 12:03 sgugger

I appreciate the reluctance to 'move fast and break things' - nice to see :)

As a user finding his way around the Hugging Face packages, it did strike me as odd that there was extra magic in the transformers tokenizer that wasn't in the underlying tokenizers tokenizer. It certainly makes troubleshooting difficult, so my humble vote would go toward deprecating.

Mar 08 '23 22:03 davidgilbertson

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 07 '23 15:04 github-actions[bot]

transformers transformers copied to clipboard

`clean_up_tokenization` too many false positives

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard