transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Fix all_special_tokens to have special tokens from added_tokens

Open ita9naiwa opened this issue 1 year ago • 7 comments

What does this PR do?

for phi-3 models, image

some important special tokens (such as template tokens <|assistant|>) are not in Tokenizer.all_special_tokens

but they re in added_vocab image

This PR adds those special tokens in added vocab into special tokens.

if I ran correctly, rust tokenizer implementation find special tokens well.

eprintln!("{:?}", self.special_tokens_set.iter().collect::<Vec<_>>());

> ["<|placeholder15|>", "<|endoftext|>", "<|placeholder13|>", "<|placeholder16|>", "<|placeholder11|>", "<|placeholder10|>", "<|system|>", "<|placeholder18|>", "<|placeholder23|>", "<|placeholder36|>", "<|image|>", "<|placeholder28|>", "<|placeholder29|>", "<|user|>", "<|placeholder3|>", "<|placeholder7|>", "<|placeholder20|>", "<unk>", "<|placeholder24|>", "<|placeholder25|>", "<|placeholder26|>", "<|placeholder1|>", "<|placeholder33|>", "<|placeholder37|>", "<|placeholder12|>", "<|placeholder32|>", "<|placeholder39|>", "<|placeholder27|>", "<|placeholder5|>", "<s>", "<|assistant|>", "<|placeholder38|>", "<|end|>", "<|placeholder14|>", "<|placeholder4|>", "<|placeholder9|>", "<|placeholder2|>", "<|placeholder22|>", "<|placeholder31|>", "<|placeholder34|>", "<|placeholder21|>", "<|placeholder8|>", "<|placeholder6|>", "<|placeholder17|>", "<|placeholder19|>", "<|placeholder30|>", "<|placeholder35|>"]

Before submitting

  • [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Pull Request section?
  • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

ita9naiwa avatar Aug 14 '24 16:08 ita9naiwa

not sure it's intended behavior tho.

CI errors: I'm seeing unexpected errors from where I didn't touch

ita9naiwa avatar Aug 14 '24 16:08 ita9naiwa

cc @ArthurZucker @itazap

amyeroberts avatar Aug 15 '24 11:08 amyeroberts

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Hi, no updates? @itazap

ita9naiwa avatar Aug 26 '24 06:08 ita9naiwa

Awesome, thanks @ita9naiwa ! @ArthurZucker good to merge? 🚀

itazap avatar Aug 26 '24 11:08 itazap

Thanks @ArthurZucker ! Something like

self._additional_special_tokens.extend(new_tokens)

in the SpecialTokensMixin.add_tokens function?

itazap avatar Aug 27 '24 15:08 itazap

I think it should only be done for tokenizer_utils (not fast)!

ArthurZucker avatar Aug 28 '24 08:08 ArthurZucker

Hi, sorry I didn't get what should be done more, can you describe more for me to get it and update?

ita9naiwa avatar Aug 28 '24 13:08 ita9naiwa

OVerall we should:

make sure they are added to special_tokens_map_extended when we add tokens no? (can make a difference in terms of performances for models like whisper )

so instead of computing the set of special added tokens every time youcall special tokens extended, we ought add the tokens to the list, when we do add_token . It's alright if we don't, but it's important to measure the performances as this can affect other applications

ArthurZucker avatar Aug 29 '24 09:08 ArthurZucker