Arthur

Results 795 comments of Arthur

Regarding efficiency, I'll check as well, the `ignore_merges` should imporve it anyways

Regarding the new added token, the "issue" is that you need to make sure you add the correct representation of the string: ```python3 >>> from tokenizers import AddedToken, pre_tokenizers >>>...

Since the strings are pre-tokenized to their bytelevel representation (it's not a normalization) then you need to add it using `pre_tokenizers.ByteLevel(False,False).pre_tokenize_str`

Mmm no then it's not added properly, let me try again, sorry forgot to check the ids

Ok: ```python >>> from tokenizers import AddedToken, pre_tokenizers >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B") >>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False)) >>> tokenizer.encode("Bác") 128256 # a new token ``` this is...

Thanks, will take that into account when refactoring

Just started working on this! 😉

Sorry! Seem like I had to postpone this! If anyone want to take over feel free to do it, otherwise will be my priority once #23909 is merge!

More delays given the recent sprints! But I think it should calm down during this summer! 😉