ChatGLM2-6B icon indicating copy to clipboard operation
ChatGLM2-6B copied to clipboard

[BUG/Help]IndexError: piece id is out of range. <title>

Open lesstommy opened this issue 1 year ago • 11 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

在 MAC M1 MAX 64G 内存机器上,尝试部署的时候,能够成功运行 web_demo.py

但任意一句话,包括:你好,会导致后台报错:

File "/Users/tommy/.cache/huggingface/modules/transformers_modules/chatglm2-6b/tokenization_chatglm.py", line 60, in convert_id_to_token return self.sp_model.IdToPiece(index) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sentencepiece/init.py", line 1045, in _batched_func return _func(self, arg) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sentencepiece/init.py", line 1038, in _func raise IndexError('piece id is out of range.') IndexError: piece id is out of range.

Expected Behavior

No response

Steps To Reproduce

python3 web_demo.py

Environment

- OS: MAC OS
- Python: 3.10
- Transformers: 
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

lesstommy avatar Jun 28 '23 03:06 lesstommy

I have the same issue.

MacOS 14 Beta Python 3.8

  File "/Users/*/ChatGLM2-6B/venv/lib/python3.8/site-packages/sentencepiece/__init__.py", line 1038, in _func
    raise IndexError('piece id is out of range.')
IndexError: piece id is out of range.

yangzhou6666 avatar Jun 28 '23 08:06 yangzhou6666

我这边出现类似错误,是因为 .bin 文件下载不对,你可以跟 huggingface 核实一下 sha256sum

hzwer avatar Jun 28 '23 09:06 hzwer

@hzwer 你好请问可以提供MD5码和下载的链接嘛

yangzhou6666 avatar Jun 28 '23 09:06 yangzhou6666

我这边出现类似错误,是因为 .bin 文件下载不对,你可以跟 huggingface 核实一下 sha256sum

我已经重新下载了 3 次,分别是手动网页,和 hub snapdownload,还是会报这个问题

lesstommy avatar Jun 28 '23 10:06 lesstommy

应该和模型没有关系,可能tokenizer有关,我也遇到了,现在还没解决

elven2016 avatar Jun 29 '23 10:06 elven2016

@hzwer 你好请问可以提供MD5码和下载的链接嘛 https://huggingface.co/THUDM/chatglm2-6b/tree/main 从这里进去可以看到sha256sum

hzwer avatar Jun 30 '23 03:06 hzwer

@hzwer 你好请问可以提供MD5码和下载的链接嘛 https://huggingface.co/THUDM/chatglm2-6b/tree/main 从这里进去可以看到sha256sum

Same problem here, even if I change to the cpu brach

everydoc avatar Jun 30 '23 14:06 everydoc

I fixed this problem by NOT changing this line below: tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True) I assume some fellow misunderstand this by reading readme, since we were asked to change this line below to use local model: model = AutoModel.from_pretrained("THUDM/chatglm2-6b",trust_remote_code=True).cuda() it's quite "自作聪明".

everydoc avatar Jul 01 '23 15:07 everydoc

一样 hf上下的的模型 decoder的时候报错 我看到tokenizer的vocob数量是64789 我这边生成的token是64881 报超出范围了 我又看了 chatglm-2的分类头大小是65024 感觉是tokenizer没对上?

moon-fall avatar Jul 02 '23 15:07 moon-fall

那就不清楚了,我开始报这个错是因为改了加载tokenizer这行代码,改回去后,只改了本地加载model这行,就没问题了,我运行demo发现,它是自动远程下载的tokenizer。

everydoc avatar Jul 02 '23 17:07 everydoc

用原始模型可以正常推理,但是用全参数微调后的模型推理会有同样的报错。检查了各个json和py文件都跟原始模型一致,请问是什么原因?

michael0905 avatar Jul 27 '23 07:07 michael0905

一样 hf上下的的模型 decoder的时候报错 我看到tokenizer的vocob数量是64789 我这边生成的token是64881 报超出范围了 我又看了 chatglm-2的分类头大小是65024 感觉是tokenizer没对上?

我发现确实也是这样的,请问你最后是怎么解决这个问题的

ysanimals avatar Aug 10 '23 03:08 ysanimals

一样 hf上下的的模型 decoder的时候报错 我看到tokenizer的vocob数量是64789 我这边生成的token是64881 报超出范围了 我又看了 chatglm-2的分类头大小是65024 感觉是tokenizer没对上?

你是对的,确实数字对不上,微调之后模型会预测出一些大于vocab_size的index。 不知道官方为什么vocab_size和分类头大小不一致。目前简单粗暴的方法就是,把他的SPTokenizer稍微改下,超出索引的直接返回空字符就可以了。

    def convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        # if index in self.index_special_tokens or index in [self.eos_id, self.bos_id, self.pad_id] or index < 0:
        if index in self.index_special_tokens or index in [self.eos_id, self.bos_id, self.pad_id] or index < 0 or index >= self.n_words: 
            return ""
        return self.sp_model.IdToPiece(index)

zouweidong91 avatar Sep 15 '23 06:09 zouweidong91

lora微调后,推理同样出现IndexError: piece id is out of range.的问题

enddlesswm avatar Oct 23 '23 08:10 enddlesswm