llama3 icon indicating copy to clipboard operation
llama3 copied to clipboard

getting issues with tokenizer

Open Anushagudipati opened this issue 1 year ago • 5 comments

unable load Tokenizer using AutoTokenizer.from_pretrained()

errors: tokenizer = AutoTokenizer.from_pretrained(model_id) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 862, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained return cls._from_pretrained( File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 120, in init raise ValueError( ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

+++++++++++++++++++++++++++++++++++

config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 654/654 [00:00<00:00, 6.03MB/s] special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 797kB/s] tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51.0k/51.0k [00:00<00:00, 55.3MB/s] The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. The class this function is called from is 'LlamaTokenizer'. You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 Traceback (most recent call last): File "/home/ubuntu/llama3-8b-base.py", line 28, in tokenizer = AutoTokenizer.from_pretrained(checkpoint_path) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 843, in from_pretrained return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained return cls._from_pretrained( File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2082, in _from_pretrained slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained( File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 182, in init self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False)) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 209, in get_spm_processor tokenizer.Load(self.vocab_file) File "/home/ubuntu/venv/lib/python3.10/site-packages/sentencepiece/init.py", line 961, in Load return self.LoadFromFile(model_file) File "/home/ubuntu/venv/lib/python3.10/site-packages/sentencepiece/init.py", line 316, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) TypeError: not a string

Anushagudipati avatar Apr 22 '24 12:04 Anushagudipati

Is the problem solved?I also encountered this problem

liu904-61 avatar Apr 24 '24 09:04 liu904-61

@liu904-61 @Anushagudipati can you pls upgrade to the latest transformers 4.40.1. This should have the latest.

HamidShojanazeri avatar Apr 24 '24 17:04 HamidShojanazeri

hey @HamidShojanazeri I am having the same issue after upgraded the transformers to 4.40.1

EmilyInTheUS avatar Apr 26 '24 19:04 EmilyInTheUS

I also encountered this problem. Is the problem solved?

Xiaoyinggit avatar May 06 '24 04:05 Xiaoyinggit

You need to change your function, the function I used in the .py script is LlamaTokenizer.from_pretrained() and you just need to change it to AutoTokenizer.from_pretrained().

xieziyi881 avatar Jun 06 '24 03:06 xieziyi881