ChatGLM-6B
ChatGLM-6B copied to clipboard
[BUG/Help] <title> 运行推理样例的时候cuda报错
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
使用提供的推理样例时报错:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("chatglm-6b", trust_remote_code=True,revision="v1.1.0") model = AutoModel.from_pretrained("chatglm-6b", trust_remote_code=True,revision="v1.1.0").quantize(8).half().cuda() model = model.eval() response, history = model.chat(tokenizer, "你好", history=[]) print(response) response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history) print(response)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00, 1.62s/it] The dtype of attention mask (torch.int64) is not bool Traceback (most recent call last): File "test.py", line 11, in <module> response, history = model.chat(tokenizer, "你好", history=[]) File "/data/xdr/anaconda3/envs/chatglm/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/data/xdr/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1285, in chat outputs = self.generate(**inputs, **gen_kwargs) File "/data/xdr/anaconda3/envs/chatglm/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/data/xdr/anaconda3/envs/chatglm/lib/python3.7/site-packages/transformers/generation/utils.py", line 1462, in generate **model_kwargs, File "/data/xdr/anaconda3/envs/chatglm/lib/python3.7/site-packages/transformers/generation/utils.py", line 2472, in sample output_hidden_states=output_hidden_states, File "/data/xdr/anaconda3/envs/chatglm/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data/xdr/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1199, in forward return_dict=return_dict, File "/data/xdr/anaconda3/envs/chatglm/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data/xdr/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1003, in forward output_attentions=output_attentions File "/data/xdr/anaconda3/envs/chatglm/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data/xdr/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 634, in forward output_attentions=output_attentions File "/data/xdr/anaconda3/envs/chatglm/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data/xdr/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 445, in forward mixed_raw_layer = self.query_key_value(hidden_states) File "/data/xdr/anaconda3/envs/chatglm/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data/xdr/.cache/huggingface/modules/transformers_modules/chatglm-6b/quantization.py", line 147, in forward output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width) File "/data/xdr/.cache/huggingface/modules/transformers_modules/chatglm-6b/quantization.py", line 53, in forward output = inp.mm(weight.t()) RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling
cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)``
Expected Behavior
No response
Steps To Reproduce
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("chatglm-6b", trust_remote_code=True,revision="v1.1.0") model = AutoModel.from_pretrained("chatglm-6b", trust_remote_code=True,revision="v1.1.0").quantize(8).half().cuda() model = model.eval() response, history = model.chat(tokenizer, "你好", history=[]) print(response) response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history) print(response)
Environment
- OS:
- Python:3.7
- Transformers:4.27.1
- PyTorch:1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True
- CUDA Version: 11.6
Anything else?
No response