Ita Zaporozhets
Ita Zaporozhets
# What does this PR do? Fixes #30685 #28648 ## Before submitting - [ ] make sure this is saved and used not only as kwargs but also the attribute...
# What does this PR do? Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks...
Use existing TikTokenConverter to convert tiktoken tokenizer.model file. Sample Usage: ``` model_file_name = 'tokenizer.model' tokenizer = AutoTokenizer.from_pretrained('hf-internal-testing/Llama3-Instruct-Internal', tiktoken_file=model_file_name, from_slow=True) ``` - [x] add case to convert_tiktoken_tokenizer - [x] add internal...
Fixes #30824 #30947 ## Tasks - [ ] fix converter to handle user_defined_symbols - [ ] create necessary flags for user_defined_symbols - [ ] update docs - [ ] test...
Fix for 2 issues: 1. `add_bos_token` & `add_eos_token` flags ignored for `PreTrainedTokenizerFast`: issue discussed [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B/discussions/140) and [here](https://github.com/huggingface/transformers/issues/30947#issuecomment-2128057992) 2. `add_special_tokens` does not update `bos_token` or `eos_token` - ex `.add_special_tokens({'bos_token': ''})` TASKS:...
We already have support for loading a fast tokenizer with a `tokenizer.model` file only. However, we still require `config.json` to exist in the model folder (on hub or locally), even...
Current status for AutoTokenizer with fast=True: 1. checks tokenizer_config.json if tokenizer_class name ends with Fast 2. if not, load a slow tokenizer (This PR): (unchanged) 1. checks tokenizer_config.json if tokenizer_class ...
blt wip
WIP of blt integration. current state: - refactored to run on CPU! 🤗 - not restyled to transformers yet