What does this PR do?

Note: The tests are failing because of repo-consistency, but I intentionally haven't made the change across the repo until we confirm what change we want to make. Will remove [Don't merge] tag once done.

This fixes an issue related to the BasicTokenizer.

Initially looked to fix #22166 by updating the _run_split_on_punc method in the BasicTokenizer to split apostrophes without starting a new word. In the issue, it was noted that apostrophes weren't being split properly as should've was being converted to should, ', and ve instead of should and 've.

However, when adding testing it became apparent there are other cases where the BasicTokenizer is failing too, such as capturing '!!' as separate tokens with id 256 as opposed to one 748 token, which is in the vocab.

To address these I modified _run_split_on_punc in the BasicTokenizer to also split on passed patterns and renamed it _split_on_punc_or_pattern and added tests for them.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@ArthurZucker

Mar 20 '23 17:03 connor-henderson

The documentation is not available anymore as the PR was closed or merged.

Mar 20 '23 17:03 HuggingFaceDocBuilderDev

I think this was still a fix we were interesting in having for users who don't have ftfy installed.

Mar 24 '23 14:03 sgugger

I think this was still a fix we were interesting in having for users who don't have ftfy installed.

Oh ok reopening in that case

Mar 24 '23 17:03 connor-henderson

The documentation is not available anymore as the PR was closed or merged.

Mar 24 '23 17:03 HuggingFaceDocBuilderDev

Thanks for your comments Arthur! Looking back at it to make your changes I realized two things:

we actually don't need to add the pattern splitting to the BasicTokenizer, we just need to not split on punctuation (just need to get out of the way of the byte pair encoding happening later)
the already written test_check_encoding_slow_fast test is a bit more comprehensive, so I used that and it passes locally now without ftfy, I'll call out the additional edits I had to make below for this

Will wait for your review again before replicating it in the other BasicTokenizers.

Also, would you like me to add additional testing? I.e., should tokenizers that use BasicTokenizer each have a test that checks slow and fast match like the comprehensive check like the one mentioned above that's in CLIP, or should they have a simple truncated version of it

Mar 28 '23 14:03 connor-henderson

Okay! I'll review again, can you make sure make quality and make repo-consistency both pass?

Apr 04 '23 09:04 ArthurZucker

Hey @ArthurZucker just checking in, anything else wanted here?

Apr 14 '23 04:04 connor-henderson

Nope thanks for the ping, it is just that it is a lot of changes on a lot of models (a lot of old models too 😉 ). Getting to it!

Apr 14 '23 09:04 ArthurZucker

Update: just keeping this PR to the punc splitting param, reasoning below. Lmk if you have other thoughts!

Wrote a script I ran locally to get a directional sense of how much of a difference each of these 3 changes (punctuation split, remove control chars, normalizing) was having to help choose how to address the above. It appears the punc split edit this PR was primarily addressing does help a fair bit, seems to increase compatibility with the CLIP fast tokenizer ~20% to near 100% for the languages tested. The control chars and normalizing edits don’t appear to make much of a difference at all (0% and ~0.1% improvement, respectively). Again this analysis was imperfect but I figure from this the cost-benefit for this PR suggests just keeping it to the punctuation splitting.

Also, I misspoke earlier saying control chars are in the CLIP vocab, they aren’t. Instead the discrepancy between the basic tokenizer and the fast one I was addressing was that the former strips them and the latter marks them as UNK. I don’t believe having this is likely to make much of a difference for inference or tokenization, as the script run suggests, since control chars are rare and UNK tokens don’t provide much info to the model.

Apr 21 '23 16:04 connor-henderson

Looking at the output for ar it seems NEW + normalize is the best match isn't it ?

I think this proves that NFC is indeed a good addition which was previously missing !

Thanks a lot for this testing script, this adds a lot of value for future potential changes !

Apr 21 '23 18:04 Narsil

I think this proves that NFC is indeed a good addition which was previously missing !

Thanks and sounds good, I'll put it back in. I had removed it only because I couldn't test all languages due to the time it would take so I wasn't certain if there could be other issues with it, and the improvements were somewhat modest.

Apr 21 '23 18:04 connor-henderson

Hey @Narsil just checking in, anything else wanted here?

May 15 '23 14:05 connor-henderson

No we just need a core maintainer's approval.

Sorry I forgot about this PR.

@sgugger for final review.

May 15 '23 15:05 Narsil

Noticed the linked issue was marked stale, this PR probably will be soon too. Any other action wanted here?

I think as the script shows this will significantly improve how well the Basic Tokenizer matches up with the fast one for CLIP, the lingering question was just around whether the NFC normalizing change was approved or whether that part should be removed. @Narsil @sgugger

Jun 16 '23 17:06 connor-henderson

Just waited for a clarification on whether @Narsil was fine with merging this or not. :-)

Jun 16 '23 17:06 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 11 '23 15:07 github-actions[bot]

Since @Narsil is not commenting, I'm guessing we can merge :man_shrugging:

Jul 11 '23 15:07 sgugger

Oh sorry ! Missed this one it's OK !

Jul 12 '23 08:07 Narsil

transformers
transformers copied to clipboard

fix: Text splitting in the BasicTokenizer

What does this PR do?

Before submitting

Who can review?

transformers transformers copied to clipboard

fix: Text splitting in the BasicTokenizer

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard