tokenizers [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder

Causes issues with ByteLevel messing up some AddedTokens with some utf-8 range used in the bytelevel mapping.

This commit tests the extend of the damage of ignoring the decoder for those tokens.

Apr 24 '24 15:04 Narsil

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Apr 24 '24 15:04 HuggingFaceDocBuilderDev

Gents, When can this fix be ready please ?

Thanks, Steve

Apr 29 '24 00:04 thusinh1969

FWIW I ran tokenizers tests on transformers and didn't find any related error in the test suite.

RUN_SLOW=1 pytest -sv tests/ -k tokenizers
AILED tests/tokenization/test_tokenization_utils.py::TokenizerUtilsTest::test_pretrained_tokenizers - AttributeError: type object 'GPT2Tokenizer' has no attribute 'max_model_input_sizes'
============================================================================================================ 1 failed, 562 passed, 36 skipped, 121 warnings in 72.36s (0:01:12) ============================================================================================================

I figured the issue has nothing to do with it.

Apr 29 '24 13:04 Narsil

More tests with actual failures

pytest -sv tests/ -k tokenizer

FAILED tests/models/bartpho/test_tokenization_bartpho.py::BartphoTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/blenderbot_small/test_tokenization_blenderbot_small.py::BlenderbotSmallTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/idefics/test_processor_idefics.py::IdeficsProcessorTest::test_tokenizer_left_padding - AssertionError: '<unk[20 chars]><unk><unk><unk><unk...
FAILED tests/models/idefics/test_processor_idefics.py::IdeficsProcessorTest::test_tokenizer_padding - AssertionError: '<s>Describe this image.\nAssistant...
FAILED tests/models/luke/test_tokenization_luke.py::LukeTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/mluke/test_tokenization_mluke.py::MLukeTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/mpnet/test_tokenization_mpnet.py::MPNetTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speech_to_text/test_tokenization_speech_to_text.py::SpeechToTextTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speech_to_text_2/test_tokenization_speech_to_text_2.py::SpeechToTextTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speecht5/test_tokenization_speecht5.py::SpeechT5TokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/vits/test_tokenization_vits.py::VitsTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/wav2vec2/test_tokenization_wav2vec2.py::Wav2Vec2CTCTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...

Apr 29 '24 16:04 Narsil

The failing test is basically the same one run on bert

May 06 '24 09:05 ArthurZucker

I'll fix it

May 17 '24 10:05 ArthurZucker