Fix convert_tekken_tokenizer
What does this PR do?
Right now the convert_tekken_tokenizer does not add bos_tokens, eos_token to the special tokens via the add_special_tokens method.
This prevents the chat templates that expect eos_token and bos_token to work properly.
Previously this was working as when saving the tokenizer a special_tokens_map.json was created which is no longer the case. Unknown to me why but I'd assume this is due to the V5 refactoring ?
This PR fixes that by adding explicitly these tokens to the tokenizer and when saving they're now stored in tokenizer_config.json.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
@ArthurZucker
cc @itazap
run-slow: ministral3, mistral3
This comment contains run-slow, running the specified jobs:
models: ["models/ministral3", "models/mistral3"] quantizations: []
Hey! Thanks for the PR, can you please share a short reproducer of the problem (you mentioned in chat templates)? perhaps we'll need to add a test !
[For maintainers] Suggested jobs to run (before merge)
run-slow: ministral3, mistral3
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.