Fix all_special_tokens to have special tokens from added_tokens
What does this PR do?
for phi-3 models,
some important special tokens (such as template tokens <|assistant|>) are not in Tokenizer.all_special_tokens
but they re in added_vocab
This PR adds those special tokens in added vocab into special tokens.
if I ran correctly, rust tokenizer implementation find special tokens well.
eprintln!("{:?}", self.special_tokens_set.iter().collect::<Vec<_>>());
> ["<|placeholder15|>", "<|endoftext|>", "<|placeholder13|>", "<|placeholder16|>", "<|placeholder11|>", "<|placeholder10|>", "<|system|>", "<|placeholder18|>", "<|placeholder23|>", "<|placeholder36|>", "<|image|>", "<|placeholder28|>", "<|placeholder29|>", "<|user|>", "<|placeholder3|>", "<|placeholder7|>", "<|placeholder20|>", "<unk>", "<|placeholder24|>", "<|placeholder25|>", "<|placeholder26|>", "<|placeholder1|>", "<|placeholder33|>", "<|placeholder37|>", "<|placeholder12|>", "<|placeholder32|>", "<|placeholder39|>", "<|placeholder27|>", "<|placeholder5|>", "<s>", "<|assistant|>", "<|placeholder38|>", "<|end|>", "<|placeholder14|>", "<|placeholder4|>", "<|placeholder9|>", "<|placeholder2|>", "<|placeholder22|>", "<|placeholder31|>", "<|placeholder34|>", "<|placeholder21|>", "<|placeholder8|>", "<|placeholder6|>", "<|placeholder17|>", "<|placeholder19|>", "<|placeholder30|>", "<|placeholder35|>"]
Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
not sure it's intended behavior tho.
CI errors: I'm seeing unexpected errors from where I didn't touch
cc @ArthurZucker @itazap
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Hi, no updates? @itazap
Awesome, thanks @ita9naiwa ! @ArthurZucker good to merge? 🚀
Thanks @ArthurZucker ! Something like
self._additional_special_tokens.extend(new_tokens)
in the SpecialTokensMixin.add_tokens function?
I think it should only be done for tokenizer_utils (not fast)!
Hi, sorry I didn't get what should be done more, can you describe more for me to get it and update?
OVerall we should:
make sure they are added to special_tokens_map_extended when we add tokens no? (can make a difference in terms of performances for models like whisper )
so instead of computing the set of special added tokens every time youcall special tokens extended, we ought add the tokens to the list, when we do add_token . It's alright if we don't, but it's important to measure the performances as this can affect other applications