Arthur
Arthur
Regarding efficiency, I'll check as well, the `ignore_merges` should imporve it anyways
Regarding the new added token, the "issue" is that you need to make sure you add the correct representation of the string: ```python3 >>> from tokenizers import AddedToken, pre_tokenizers >>>...
Since the strings are pre-tokenized to their bytelevel representation (it's not a normalization) then you need to add it using `pre_tokenizers.ByteLevel(False,False).pre_tokenize_str`
Mmm no then it's not added properly, let me try again, sorry forgot to check the ids
Ok: ```python >>> from tokenizers import AddedToken, pre_tokenizers >>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B") >>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False)) >>> tokenizer.encode("Bác") 128256 # a new token ``` this is...
Thanks, will take that into account when refactoring
Just started working on this! 😉
Sorry! Seem like I had to postpone this! If anyone want to take over feel free to do it, otherwise will be my priority once #23909 is merge!
More delays given the recent sprints! But I think it should calm down during this summer! 😉
Hey! Pretty sure this was fixed on main!