Arthur comments

Results 795 comments of


                                            Arthur

I can not extend vocab of LLaMA-3 using sentencepiece anymore vs LLaMA-2 ?!?

Regarding efficiency, I'll check as well, the `ignore_merges` should imporve it anyways

I can not extend vocab of LLaMA-3 using sentencepiece anymore vs LLaMA-2 ?!?

Regarding the new added token, the "issue" is that you need to make sure you add the correct representation of the string: ```python3 >>> from tokenizers import AddedToken, pre_tokenizers >>>...

I can not extend vocab of LLaMA-3 using sentencepiece anymore vs LLaMA-2 ?!?

Since the strings are pre-tokenized to their bytelevel representation (it's not a normalization) then you need to add it using `pre_tokenizers.ByteLevel(False,False).pre_tokenize_str`

I can not extend vocab of LLaMA-3 using sentencepiece anymore vs LLaMA-2 ?!?

Mmm no then it's not added properly, let me try again, sorry forgot to check the ids

I can not extend vocab of LLaMA-3 using sentencepiece anymore vs LLaMA-2 ?!?

Ok: ```python >>> from tokenizers import AddedToken, pre_tokenizers >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B") >>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False)) >>> tokenizer.encode("Bác") 128256 # a new token ``` this is...

Cannot export Deberta to TorchScript

Thanks, will take that into account when refactoring

Cannot export Deberta to TorchScript

Just started working on this! 😉

Cannot export Deberta to TorchScript

Sorry! Seem like I had to postpone this! If anyone want to take over feel free to do it, otherwise will be my priority once #23909 is merge!

Cannot export Deberta to TorchScript

More delays given the recent sprints! But I think it should calm down during this summer! 😉

With deepspeed zero3 enabled, loading from_pretrained() and resize_token_embeddings() do not work correctly

Hey! Pretty sure this was fixed on main!