ChatGLM-6B
ChatGLM-6B copied to clipboard
[BUG/Help] linux下chatglm-6b-int4模型无法用GPU加载
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
用GPU加载chatglm-6b-int4模型,kernel编译失败:
>>> from transformers import AutoTokenizer, AutoModel
>>> model = AutoModel.from_pretrained("./chatglm-6b-int4", trust_remote_code=True).half().cuda()
Explicitly passing a ‘revision’ is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a ‘revision’ is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
No compiled kernel found.
Compiling kernels : /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.c
Compiling gcc -O3 -pthread -fopenmp -std=c99 /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.c -shared -o /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels_parallel.so
/usr/bin/ld: /tmp/ccjNdJf6.o: relocation R_x86_64_32 against '. text ' can not be used when making a shared object; recompile with - fPIC
/tmp/ccjNdJf6.o: error adding symbols: Bad value
collect2: error: ld returned 1 exit status
Compile failed , using default cpu kernel code.
Compiling gcc -O3 -fPIC -std=c99 /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.c -shared -o /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.so
Kernels compiled : /home/pollymars/.cache/huggingface/modules/transformers_modules/local/quantization_kernels.so
Cannot load cpu kernel, don't use quantized model on cpu.
Using quantization cache
Applying quantization to glm layers
调用模型则返回: RuntimeError: CUDA Error: no kernel image is available for execution on the device
尝试手动编译并指定kernel,未能解决,还会再走一遍上述编译流程,然后报上述一样的CUDA Error: gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels_parallel.c -shared -o quantization_kernels_parallel.so model = AutoModel.from_pretrained("./chatglm-6b-int4",trust_remote_code=True).half().cuda() model = model.quantize(bits=4, kernel_file="/home/pollymars/ChatGLM-6B-main/chatglm-6b-int4/quantization_kernels_parallel.so")
用CPU加载chatglm-6b-int4模型,手动编译并指定kernel则可以成功运行模型,但运算速度慢。
Expected Behavior
No response
Steps To Reproduce
linux下加载chatglm-6b-int4模型,GPU kernel编译失败,手动编译并指定kernel也未解决。
Environment
- OS: Ubuntu 5.4.0-6ubentul~16.04.9
- Python: 3.8.5
- Transformers: 4.26.1
- PyTorch: 1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True
- gcc: 5.4.0 20160609
- openmp: 201307
Anything else?
No response
``请问"./chatglm-6b-int4"中是最新的代码吗?
CUDA Error和CPU kernel无关,可能是cpm_kernels的问题?在最新的代码中,如果加载cpm_kernels失败,会有如下输出:
Failed to load cpm_kernels:
此外也有可能是CUDA的问题:
可以简单尝试:
>>>test=torch.Tensor([1,2,3,4])
>>>test=test.cuda()
>>>test
如果有问题,请问您的显卡是?
"./chatglm-6b-int4"中就是huggingface上面对应的代码和模型文件。 没有看到关于cpm_kernels的报错。 显卡是A100,我测试下。 有没有可能是gcc和openmp的版本问题?谢谢!
感觉应该不是gcc和openmp的问题。
RuntimeError: CUDA Error: no kernel image is available for execution on the device
这个错误感觉是CUDA不能使用(虽然torch.cuda.is_available() == True,但也有可能torch和cuda版本不匹配导致不能正确运行)。
>>>test=torch.Tensor([1,2,3,4])
>>>test=test.cuda()
>>>test
这段代码可以成功运行吗?测一下CUDA能不能用。
可以成功运行的。 我更新了"./chatglm-6b-int4"中的代码,现在kernel编译成功了,不过模型推理时还是报“RuntimeError: CUDA Error: no kernel image is available for execution on the device”。 报错信息显示,是在quantization.py的235行,即调用kernels.int4WeightExtractionHalf的时候,进到cpm_kernels/library/cuda.py,checkCUStatus(cuda.cuModuleLoadData(ctypes.byref(module), data))这一行报错。
可能和issue #119 的问题类似。
这里说需要算力6.1及以上,太坑了吧 https://github.com/OpenBMB/BMInf/issues/29#issuecomment-1000951808
Duplicate of #119