tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder

Open Narsil opened this issue 1 year ago • 4 comments

Causes issues with ByteLevel messing up some AddedTokens with some utf-8 range used in the bytelevel mapping.

This commit tests the extend of the damage of ignoring the decoder for those tokens.

Narsil avatar Apr 24 '24 15:04 Narsil

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Gents, When can this fix be ready please ?

Thanks, Steve

thusinh1969 avatar Apr 29 '24 00:04 thusinh1969

FWIW I ran tokenizers tests on transformers and didn't find any related error in the test suite.

RUN_SLOW=1 pytest -sv tests/ -k tokenizers
AILED tests/tokenization/test_tokenization_utils.py::TokenizerUtilsTest::test_pretrained_tokenizers - AttributeError: type object 'GPT2Tokenizer' has no attribute 'max_model_input_sizes'
============================================================================================================ 1 failed, 562 passed, 36 skipped, 121 warnings in 72.36s (0:01:12) ============================================================================================================

I figured the issue has nothing to do with it.

Narsil avatar Apr 29 '24 13:04 Narsil

More tests with actual failures

pytest -sv tests/ -k tokenizer
FAILED tests/models/bartpho/test_tokenization_bartpho.py::BartphoTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/blenderbot_small/test_tokenization_blenderbot_small.py::BlenderbotSmallTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/idefics/test_processor_idefics.py::IdeficsProcessorTest::test_tokenizer_left_padding - AssertionError: '<unk[20 chars]><unk><unk><unk><unk...
FAILED tests/models/idefics/test_processor_idefics.py::IdeficsProcessorTest::test_tokenizer_padding - AssertionError: '<s>Describe this image.\nAssistant...
FAILED tests/models/luke/test_tokenization_luke.py::LukeTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/mluke/test_tokenization_mluke.py::MLukeTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/mpnet/test_tokenization_mpnet.py::MPNetTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speech_to_text/test_tokenization_speech_to_text.py::SpeechToTextTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speech_to_text_2/test_tokenization_speech_to_text_2.py::SpeechToTextTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/speecht5/test_tokenization_speecht5.py::SpeechT5TokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/vits/test_tokenization_vits.py::VitsTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/wav2vec2/test_tokenization_wav2vec2.py::Wav2Vec2CTCTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...
FAILED tests/models/whisper/test_tokenization_whisper.py::WhisperTokenizerTest::test_clean_up_tokenization_spaces - assert "[CLS]this sh...' ll go.[SEP]" == "[CLS] thi...

Narsil avatar Apr 29 '24 16:04 Narsil

The failing test is basically the same one run on bert

ArthurZucker avatar May 06 '24 09:05 ArthurZucker

I'll fix it

ArthurZucker avatar May 17 '24 10:05 ArthurZucker