getting issues with tokenizer
unable load Tokenizer using AutoTokenizer.from_pretrained()
errors:
tokenizer = AutoTokenizer.from_pretrained(model_id)
File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 862, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
return cls._from_pretrained(
File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 120, in init
raise ValueError(
ValueError: Couldn't instantiate the backend tokenizer from one of:
(1) a tokenizers library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.
+++++++++++++++++++++++++++++++++++
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 654/654 [00:00<00:00, 6.03MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 797kB/s]
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51.0k/51.0k [00:00<00:00, 55.3MB/s]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'.
The class this function is called from is 'LlamaTokenizer'.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Traceback (most recent call last):
File "/home/ubuntu/llama3-8b-base.py", line 28, in
Is the problem solved?I also encountered this problem
@liu904-61 @Anushagudipati can you pls upgrade to the latest transformers 4.40.1. This should have the latest.
hey @HamidShojanazeri I am having the same issue after upgraded the transformers to 4.40.1
I also encountered this problem. Is the problem solved?
You need to change your function, the function I used in the .py script is LlamaTokenizer.from_pretrained() and you just need to change it to AutoTokenizer.from_pretrained().