Ita Zaporozhets comments

Results 37 comments of


                                            Ita Zaporozhets

Fixed NoneType attribute crash in tokenization_utils_base.py

Hi @ElleLeonne! I am unable to reproduce the error with the code snippet provided. I only observe the following warning: ``` Token indices sequence length is longer than the specified...

[SigLIP] Add fast tokenizer

@NielsRogge Thank you for your patience! I'm looking into the failing tests now

@NielsRogge Upon further inspection of the failing tests, the rust tokenizer is not equal to the python tokenizer. There are some key issues/differences, including: 1. The ```SiglipConverter.normalizer``` is dropping all...

[SigLIP] Add fast tokenizer

@NielsRogge in that function, the `text = text.strip()` is causing the discrepancy. In `PreTrainerTokenizer.tokenize()` , the input string gets split on special tokens. Then, your `canonicalize_text` may be called on...

[SigLIP] Add fast tokenizer

@NielsRogge Yes I can look into this! 😄 I may follow up to ask about expected behaviour of siglip 👀

[SigLIP] Add fast tokenizer

@NielsRogge are you able to please rebase when you have a chance? I don't have permission to push a rebase on this!

[SigLIP] Add fast tokenizer

Summary of changes: - `test_chat_template_return_assistant_tokens_mask` skipped because siglip strips the punctuation used in chat templates and this test is too specific with matching punctuation chars like pipe - `self.assertNotEqual(sp_tokens, tokens)`...

[SigLIP] Add fast tokenizer

@ArthurZucker @NielsRogge we cannot merge until we merge a feature to support loading Fast without a Fast tokenizer class specified in the tokenizer_config file! need either https://github.com/huggingface/transformers/pull/34212 or https://github.com/huggingface/transformers/pull/33751 merged...

Error when using AutoTokenizer to load local files without network

Hello @pppppkun ! I'm not able to reproduce this issue, I cloned the repo and copied it to a local folder ('home/chatglm3-6b') and it correctly accesses it without network with...

The behavior of the tokenizer loaded from GGUF file is incorrect.

I found that this is caused by setting `add_prefix_space=False` in `GGUFLammaConverter`. In turn, then the `from_slow=True` is forced from #28010 . I checked loading from `"meta-llama/Meta-Llama-3-8B"` and I don't believe...