[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder
Causes issues with ByteLevel messing up some AddedTokens with some
utf-8 range used in the bytelevel mapping.
This commit tests the extend of the damage of ignoring the decoder for those tokens.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Gents, When can this fix be ready please ?
Thanks, Steve
FWIW I ran tokenizers tests on transformers and didn't find any related error in the test suite.
RUN_SLOW=1 pytest -sv tests/ -k tokenizers
AILED tests/tokenization/test_tokenization_utils.py::TokenizerUtilsTest::test_pretrained_tokenizers - AttributeError: type object 'GPT2Tokenizer' has no attribute 'max_model_input_sizes'
============================================================================================================ 1 failed, 562 passed, 36 skipped, 121 warnings in 72.36s (0:01:12) ============================================================================================================
I figured the issue has nothing to do with it.
More tests with actual failures
pytest -sv tests/ -k tokenizer
FAILED tests/models/bartpho/test_tokenization_bartpho.py::BartphoTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/blenderbot_small/test_tokenization_blenderbot_small.py::BlenderbotSmallTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/idefics/test_processor_idefics.py::IdeficsProcessorTest::test_tokenizer_left_padding - AssertionError: '<unk[20 chars]><unk><unk><unk><unk...
FAILED tests/models/idefics/test_processor_idefics.py::IdeficsProcessorTest::test_tokenizer_padding - AssertionError: '<s>Describe this image.\nAssistant...
FAILED tests/models/luke/test_tokenization_luke.py::LukeTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/mluke/test_tokenization_mluke.py::MLukeTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/mpnet/test_tokenization_mpnet.py::MPNetTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speech_to_text/test_tokenization_speech_to_text.py::SpeechToTextTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speech_to_text_2/test_tokenization_speech_to_text_2.py::SpeechToTextTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speecht5/test_tokenization_speecht5.py::SpeechT5TokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/vits/test_tokenization_vits.py::VitsTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/wav2vec2/test_tokenization_wav2vec2.py::Wav2Vec2CTCTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
The failing test is basically the same one run on bert
I'll fix it