ChatGLM-6B icon indicating copy to clipboard operation
ChatGLM-6B copied to clipboard

google.protobuf.message.DecodeError: Error parsing message with type 'sentencepiece.ModelProto'

Open WangChenxu21 opened this issue 2 years ago • 9 comments

raise "google.protobuf.message.DecodeError: Error parsing message with type 'sentencepiece.ModelProto'"

protobuf==3.20.0

WangChenxu21 avatar Mar 14 '23 11:03 WangChenxu21

Could you please give the full error log. Specifically, I need to know which line this error occurs on.

duzx16 avatar Mar 14 '23 12:03 duzx16

Alright.

Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Traceback (most recent call last): File "web_demo.py", line 4, in tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True) File "/home/users/chenxu.wang/anaconda3/envs/hdflow/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 658, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/home/users/chenxu.wang/anaconda3/envs/hdflow/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained return cls._from_pretrained( File "/home/users/chenxu.wang/anaconda3/envs/hdflow/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1959, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/home/users/chenxu.wang/.cache/huggingface/modules/transformers_modules/local/tokenization_chatglm.py", line 214, in init self.sp_tokenizer = SPTokenizer(vocab_file) File "/home/users/chenxu.wang/.cache/huggingface/modules/transformers_modules/local/tokenization_chatglm.py", line 35, in init self.text_tokenizer = self._build_text_tokenizer(encode_special_tokens=False) File "/home/users/chenxu.wang/.cache/huggingface/modules/transformers_modules/local/tokenization_chatglm.py", line 67, in _build_text_tokenizer tokenizer = TextTokenizer(self.vocab_file) File "/home/users/chenxu.wang/anaconda3/envs/hdflow/lib/python3.8/site-packages/icetk/text_tokenizer.py", line 26, in init self.proto.ParseFromString(proto_str) google.protobuf.message.DecodeError: Error parsing message with type 'sentencepiece.ModelProto'

WangChenxu21 avatar Mar 14 '23 12:03 WangChenxu21

Alright.

Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Traceback (most recent call last): File "web_demo.py", line 4, in tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True) File "/home/users/chenxu.wang/anaconda3/envs/hdflow/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 658, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/home/users/chenxu.wang/anaconda3/envs/hdflow/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained return cls._from_pretrained( File "/home/users/chenxu.wang/anaconda3/envs/hdflow/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1959, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/home/users/chenxu.wang/.cache/huggingface/modules/transformers_modules/local/tokenization_chatglm.py", line 214, in init self.sp_tokenizer = SPTokenizer(vocab_file) File "/home/users/chenxu.wang/.cache/huggingface/modules/transformers_modules/local/tokenization_chatglm.py", line 35, in init self.text_tokenizer = self._build_text_tokenizer(encode_special_tokens=False) File "/home/users/chenxu.wang/.cache/huggingface/modules/transformers_modules/local/tokenization_chatglm.py", line 67, in _build_text_tokenizer tokenizer = TextTokenizer(self.vocab_file) File "/home/users/chenxu.wang/anaconda3/envs/hdflow/lib/python3.8/site-packages/icetk/text_tokenizer.py", line 26, in init self.proto.ParseFromString(proto_str) google.protobuf.message.DecodeError: Error parsing message with type 'sentencepiece.ModelProto'

Are you sure you correctly downloaded the ice_text.model file? The correct file size should be about 2.7MB.

duzx16 avatar Mar 14 '23 12:03 duzx16

I have the same error. I have downloaded model files from huggingface, model files as followed

image

jamestch avatar Mar 15 '23 00:03 jamestch

I think it's correct.

image

WangChenxu21 avatar Mar 15 '23 07:03 WangChenxu21

I have the same error. I have downloaded model files from huggingface, model files as followed

image

@jamestch You also need to download the ice_text.model file.

duzx16 avatar Mar 15 '23 07:03 duzx16

You need to install git lfs to download large models from hf.

OedoSoldier avatar Mar 15 '23 10:03 OedoSoldier

You need to install git lfs to download large models from hf. image

I have installed git lfs, model files showed above is downloaded from huggingface by command git clone https://huggingface.co/THUDM/chatglm-6b

jamestch avatar Mar 15 '23 13:03 jamestch

I have the same error. I have downloaded model files from huggingface, model files as followed image

@jamestch You also need to download the ice_text.model file.

I have downloaded the ice_text.model file correctly since the SHA256 is right. But it did not work

jamestch avatar Mar 15 '23 13:03 jamestch

I met the same error when loading the pre-trained LLaMA-2-7b model using the function AutoTokenizer.from_pretrained(). After setting use_fast=True, this error was fixed.

ShunLu91 avatar Apr 19 '24 03:04 ShunLu91

Finally, I re-downloaded the tokenizer-relevant files and solved this issue. Take the LLaMA-2-7b-hf as an example, update the files as below:

  • special_tokens_map.json
  • tokenizer_config.json
  • tokenizer.json
  • tokenizer.model

ShunLu91 avatar Apr 19 '24 10:04 ShunLu91