ChatGLM-6B [BUG/Help] <在Linux环境安装CPU版本（int4版本）时：Compile default cpu kernel failed>

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

在centos环境下，运行cli_demo.py时，会报 Compile default cpu kernel failed错误，经检查已经安装gcc和openmp了，所以该如何解决这个问题？

检查gcc：

gcc -v

使用内建 specs。
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
目标：x86_64-redhat-linux
配置为：../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
线程模型：posix
gcc 版本 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

openmp检查

rpm -qa | grep libgomp

libgomp-4.8.5-44.el7.x86_64

报错日志如下：

Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
No compiled kernel found.
Compiling kernels : /root/.cache/huggingface/modules/transformers_modules/model/quantization_kernels_parallel.c
Compiling gcc -O3 -fPIC -pthread -fopenmp -std=c99 /root/.cache/huggingface/modules/transformers_modules/model/quantization_kernels_parallel.c -shared -o /root/.cache/huggingface/modules/transformers_modules/model/quantization_kernels_parallel.so
Compile default cpu kernel failed, using default cpu kernel code.
Compiling gcc -O3 -fPIC -std=c99 /root/.cache/huggingface/modules/transformers_modules/model/quantization_kernels.c -shared -o /root/.cache/huggingface/modules/transformers_modules/model/quantization_kernels.so
Compile default cpu kernel failed.
Failed to load kernel.
Cannot load cpu kernel, don't use quantized model on cpu.
Using quantization cache
Applying quantization to glm layers
欢迎使用 ChatGLM-6B 模型，输入内容即可进行对话，clear 清空对话历史，stop 终止程序

用户：nihao
Traceback (most recent call last):
  File "/home/manager/ChatGLM-6B/cli_demo.py", line 58, in <module>
    main()
  File "/home/manager/ChatGLM-6B/cli_demo.py", line 43, in main
    for response, history in model.stream_chat(tokenizer, query, history=history):
  File "/root/miniconda3/envs/chatglm/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/root/.cache/huggingface/modules/transformers_modules/model/modeling_chatglm.py", line 1311, in stream_chat
    for outputs in self.stream_generate(**inputs, **gen_kwargs):
  File "/root/miniconda3/envs/chatglm/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/root/.cache/huggingface/modules/transformers_modules/model/modeling_chatglm.py", line 1388, in stream_generate
    outputs = self(
  File "/root/miniconda3/envs/chatglm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/model/modeling_chatglm.py", line 1190, in forward
    transformer_outputs = self.transformer(
  File "/root/miniconda3/envs/chatglm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/model/modeling_chatglm.py", line 996, in forward
    layer_ret = layer(
  File "/root/miniconda3/envs/chatglm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/model/modeling_chatglm.py", line 627, in forward
    attention_outputs = self.attention(
  File "/root/miniconda3/envs/chatglm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/model/modeling_chatglm.py", line 445, in forward
    mixed_raw_layer = self.query_key_value(hidden_states)
  File "/root/miniconda3/envs/chatglm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/model/quantization.py", line 388, in forward
    output = W8A16LinearCPU.apply(input, self.weight, self.weight_scale, self.weight_bit_width,
  File "/root/miniconda3/envs/chatglm/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/root/.cache/huggingface/modules/transformers_modules/model/quantization.py", line 80, in forward
    weight = extract_weight_to_float(quant_w, scale_w, weight_bit_width, quantization_cache=quantization_cache)
  File "/root/.cache/huggingface/modules/transformers_modules/model/quantization.py", line 316, in extract_weight_to_float
    func(
TypeError: 'NoneType' object is not callable

看到一个帖子上说，去看看quantization_kernels_parallel.so 文件是否存在，我去查了一下，发现文件确实不存在，所以是自动编译这个文件的问题？那该如何解决呢？

Expected Behavior

可以正常运行demo

Steps To Reproduce

首先在根目录创建一个model文件夹，在huggingface页面（https://huggingface.co/THUDM/chatglm-6b-int4/tree/main）进行模型的下载，然后把所有的文件保存到model文件夹下。

然后创建conda环境，并安装依赖

conda create -n chatglm python=3.9
conda activate chatglm
pip install -r requirements.txt

修改cli_demo.py的代码，改下模型路径（model）和CPU部署（.float()）。

tokenizer = AutoTokenizer.from_pretrained("model", trust_remote_code=True)
model = AutoModel.from_pretrained("model", trust_remote_code=True).float()

运行cli_demo.py，则会报错：

python cli_demo.py

Environment

- OS: Linux version 3.10.0-957.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Thu Nov 8 23:39:32 UTC 2018
- Python: 3.9.16
- Transformers:4.27.1
- PyTorch: 2.0.1
- CUDA Support :False

Anything else?

有可能是gcc和openmp的版本不对？？

Jun 08 '23 02:06 MooTong123

遇到了同样的问题

Aug 21 '23 10:08 hellonlp

how to solve that problem？ can anyone give some advice?

Aug 22 '23 10:08 MadgeLiu

遇到了同样的问题 how to solve that problem？ can anyone give some advice?

我已经解决了这个问题，原因就是gcc的编译问题，需要手动编译这两个文件就可以了。

手动编译两个文件：quantization_kernels.c 和 quantization_kernels_parallel.c

gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels.c -shared -o quantization_kernels.so
gcc -fPIC -pthread -fopenmp -std=c99 quantization_kernels_parallel.c -shared -o quantization_kernels_parallel.so

然后在代码中，再添加一行代码：

model = model.quantize(bits=4, kernel_file="model/quantization_kernels.so")

这样子就可以成功运行了

Aug 24 '23 09:08 MooTong123

ChatGLM-6B ChatGLM-6B copied to clipboard

[BUG/Help] <在Linux环境安装CPU版本（int4版本）时：Compile default cpu kernel failed>

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM-6B
ChatGLM-6B copied to clipboard