Fix for slow the bug tokenizer adding spaces to single id decodes

Open DuyguA opened this issue 1 year ago • 2 comments

What does this PR do?

Quick fix for a bug with the tokenizer, slow tokenizers add spaces in between when the input is a single id.

Fixes #29489

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

@ArthurZucker

Aug 09 '24 11:08 DuyguA

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Aug 26 '24 16:08 HuggingFaceDocBuilderDev

cc @itazap as well!

Aug 27 '24 11:08 LysandreJik

Thanks for the quick update 🤗 Thanks for merging the tests! Left a few comments about the single special token case, let me know what you think!

No worries, I'll do the changes :wink:

Aug 29 '24 13:08 DuyguA

@ArthurZucker and @LysandreJik merge time please :wink:

Sep 09 '24 11:09 DuyguA

Gentle ping @itazap , can we do the merge? Some commits from the main was failing this branch but looks like all fixed , can we do the merge before any more breaking changes come :grin: :grin: :grimacing:

Sep 17 '24 08:09 DuyguA

@DuyguA Sorry for the delay! Merged !! 🚀 Thanks for working on this 🤗

Sep 18 '24 10:09 itazap

🎉Congrats @DuyguA !This issue has really been a long journey.

Sep 18 '24 10:09 Ki-Seki