ChatGLM2-6B [BUG/Help]IndexError: piece id is out of range. <title>

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

在 MAC M1 MAX 64G 内存机器上，尝试部署的时候，能够成功运行 web_demo.py

但任意一句话，包括：你好，会导致后台报错：

File "/Users/tommy/.cache/huggingface/modules/transformers_modules/chatglm2-6b/tokenization_chatglm.py", line 60, in convert_id_to_token return self.sp_model.IdToPiece(index) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sentencepiece/init.py", line 1045, in _batched_func return _func(self, arg) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sentencepiece/init.py", line 1038, in _func raise IndexError('piece id is out of range.') IndexError: piece id is out of range.

Expected Behavior

No response

Steps To Reproduce

python3 web_demo.py

Environment

- OS: MAC OS
- Python: 3.10
- Transformers: 
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

Jun 28 '23 03:06 lesstommy

I have the same issue.

MacOS 14 Beta Python 3.8

  File "/Users/*/ChatGLM2-6B/venv/lib/python3.8/site-packages/sentencepiece/__init__.py", line 1038, in _func
    raise IndexError('piece id is out of range.')
IndexError: piece id is out of range.

Jun 28 '23 08:06 yangzhou6666

我这边出现类似错误，是因为 .bin 文件下载不对，你可以跟 huggingface 核实一下 sha256sum

Jun 28 '23 09:06 hzwer

@hzwer 你好请问可以提供MD5码和下载的链接嘛

Jun 28 '23 09:06 yangzhou6666

我这边出现类似错误，是因为 .bin 文件下载不对，你可以跟 huggingface 核实一下 sha256sum

我已经重新下载了 3 次，分别是手动网页，和 hub snapdownload，还是会报这个问题

Jun 28 '23 10:06 lesstommy

应该和模型没有关系，可能tokenizer有关，我也遇到了，现在还没解决

Jun 29 '23 10:06 elven2016

@hzwer 你好请问可以提供MD5码和下载的链接嘛 https://huggingface.co/THUDM/chatglm2-6b/tree/main 从这里进去可以看到sha256sum

Jun 30 '23 03:06 hzwer

@hzwer 你好请问可以提供MD5码和下载的链接嘛 https://huggingface.co/THUDM/chatglm2-6b/tree/main 从这里进去可以看到sha256sum

Same problem here, even if I change to the cpu brach

Jun 30 '23 14:06 everydoc

I fixed this problem by NOT changing this line below: tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) I assume some fellow misunderstand this by reading readme, since we were asked to change this line below to use local model: model = AutoModel.from_pretrained("THUDM/chatglm2-6b",trust_remote_code=True).cuda() it's quite "自作聪明".

Jul 01 '23 15:07 everydoc

一样 hf上下的的模型 decoder的时候报错我看到tokenizer的vocob数量是64789 我这边生成的token是64881 报超出范围了我又看了 chatglm-2的分类头大小是65024 感觉是tokenizer没对上?

Jul 02 '23 15:07 moon-fall

那就不清楚了，我开始报这个错是因为改了加载tokenizer这行代码，改回去后，只改了本地加载model这行，就没问题了，我运行demo发现，它是自动远程下载的tokenizer。

Jul 02 '23 17:07 everydoc

用原始模型可以正常推理，但是用全参数微调后的模型推理会有同样的报错。检查了各个json和py文件都跟原始模型一致，请问是什么原因？

Jul 27 '23 07:07 michael0905

一样 hf上下的的模型 decoder的时候报错我看到tokenizer的vocob数量是64789 我这边生成的token是64881 报超出范围了我又看了 chatglm-2的分类头大小是65024 感觉是tokenizer没对上?

我发现确实也是这样的，请问你最后是怎么解决这个问题的

Aug 10 '23 03:08 ysanimals

一样 hf上下的的模型 decoder的时候报错我看到tokenizer的vocob数量是64789 我这边生成的token是64881 报超出范围了我又看了 chatglm-2的分类头大小是65024 感觉是tokenizer没对上?

你是对的，确实数字对不上，微调之后模型会预测出一些大于vocab_size的index。不知道官方为什么vocab_size和分类头大小不一致。目前简单粗暴的方法就是，把他的SPTokenizer稍微改下，超出索引的直接返回空字符就可以了。

    def convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        # if index in self.index_special_tokens or index in [self.eos_id, self.bos_id, self.pad_id] or index < 0:
        if index in self.index_special_tokens or index in [self.eos_id, self.bos_id, self.pad_id] or index < 0 or index >= self.n_words: 
            return ""
        return self.sp_model.IdToPiece(index)

Sep 15 '23 06:09 zouweidong91

lora微调后，推理同样出现IndexError: piece id is out of range.的问题

Oct 23 '23 08:10 enddlesswm

ChatGLM2-6B ChatGLM2-6B copied to clipboard

[BUG/Help]IndexError: piece id is out of range. <title>

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM2-6B
ChatGLM2-6B copied to clipboard