Ita Zaporozhets
Ita Zaporozhets
Hi @ElleLeonne! I am unable to reproduce the error with the code snippet provided. I only observe the following warning: ``` Token indices sequence length is longer than the specified...
@NielsRogge Thank you for your patience! I'm looking into the failing tests now
@NielsRogge Upon further inspection of the failing tests, the rust tokenizer is not equal to the python tokenizer. There are some key issues/differences, including: 1. The ```SiglipConverter.normalizer``` is dropping all...
@NielsRogge in that function, the `text = text.strip()` is causing the discrepancy. In `PreTrainerTokenizer.tokenize()` , the input string gets split on special tokens. Then, your `canonicalize_text` may be called on...
@NielsRogge Yes I can look into this! 😄 I may follow up to ask about expected behaviour of siglip 👀
@NielsRogge are you able to please rebase when you have a chance? I don't have permission to push a rebase on this!
Summary of changes: - `test_chat_template_return_assistant_tokens_mask` skipped because siglip strips the punctuation used in chat templates and this test is too specific with matching punctuation chars like pipe - `self.assertNotEqual(sp_tokens, tokens)`...
@ArthurZucker @NielsRogge we cannot merge until we merge a feature to support loading Fast without a Fast tokenizer class specified in the tokenizer_config file! need either https://github.com/huggingface/transformers/pull/34212 or https://github.com/huggingface/transformers/pull/33751 merged...
Hello @pppppkun ! I'm not able to reproduce this issue, I cloned the repo and copied it to a local folder ('home/chatglm3-6b') and it correctly accesses it without network with...
I found that this is caused by setting `add_prefix_space=False` in `GGUFLammaConverter`. In turn, then the `from_slow=True` is forced from #28010 . I checked loading from `"meta-llama/Meta-Llama-3-8B"` and I don't believe...