text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

CPU installation wont work (NameError: name 'quant_cuda' is not defined)

Open Steelman14aUA opened this issue 1 year ago • 10 comments

Describe the bug

when i ask something of the any model i'll get NameError: name 'quant_cuda' is not defined

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

ask something on cpu instalation

Screenshot

No response

Logs

Starting the web UI...
Loading the extension "gallery"... Ok.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
No model is loaded! Select one in the Model tab.
Loading gpt4-x-alpaca-13b-native-4bit-128g...
CUDA extension not installed.
Found the following quantized model: models\gpt4-x-alpaca-13b-native-4bit-128g\gpt-x-alpaca-13b-native-4bit-128g-cuda.pt
Loading model ...
Done.
Loaded the model in 40.36 seconds.
Traceback (most recent call last):
  File "D:\Vicuna\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "D:\Vicuna\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 251, in generate_with_callback
    shared.model.generate(**kwargs)
  File "D:\Vicuna\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\Vicuna\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "D:\Vicuna\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "D:\Vicuna\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Vicuna\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "D:\Vicuna\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Vicuna\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "D:\Vicuna\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Vicuna\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "D:\Vicuna\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Vicuna\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "D:\Vicuna\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Vicuna\oobabooga-windows\text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 426, in forward
    quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.qzeros, self.groupsize)
NameError: name 'quant_cuda' is not defined
Output generated in 0.64 seconds (0.00 tokens/s, 0 tokens, context 33, seed 213405660)

System Info

i7

Steelman14aUA avatar Apr 15 '23 14:04 Steelman14aUA

I think you can't use a 4bit model on CPU.

Ph0rk0z avatar Apr 15 '23 16:04 Ph0rk0z

It's because GPTQ_load uses quant from the GPTQ cuda branch.

image

quant requires quant_cuda.cpp

GPTQ trition branch doesn't use quant_cuda I don't think the oobabooga textui is set up for that.

Erika-wby avatar Apr 16 '23 07:04 Erika-wby

I have the exact same problem and I'm on a good GPU. I don't think it is because of your on CPU.

Zach9113 avatar Apr 16 '23 17:04 Zach9113

image

You got to setup quant_cuda zach

Erika-wby avatar Apr 16 '23 18:04 Erika-wby

stays it cannot find triton 2.0 when I try to install requirements

Zach9113 avatar Apr 17 '23 18:04 Zach9113

I started from scratch and ended up in the same spot I did see on install other was an error and the cuda became 0.0.0 I know it needs to be changed to 11.8 but I have no idea how

Zach9113 avatar Apr 17 '23 19:04 Zach9113

@Steelman14aUA the model you are using indicates its for CUDA and does not support cpu, use the non cuda model.

The newest version of the text-web ui supports GPTQ triton.

Erika-wby avatar Apr 17 '23 19:04 Erika-wby

I too have been trying to get CPU generation working without success. I tried cloning the triton repo from oobabooga, but it seems to have been refactored and is now lacking dependencies (specifically the modelutils.py file). I tried using the last commit from the repo that still has this file, and after passing in the --no-warmup_autotune I can get it running without error messages. Now though, when I try generating output, I get nothing returned.

Is there a known "good" commit from the repo that we should be using, or am I missing something else?

CubeTheThird avatar Apr 19 '23 01:04 CubeTheThird

I think its better to convert that model to llama CPP ggml and use it on CPU that way with the wrapper.

Ph0rk0z avatar Apr 19 '23 02:04 Ph0rk0z

Yes, you're right. Using a ggml model seems to have worked. Thanks!

CubeTheThird avatar Apr 19 '23 03:04 CubeTheThird

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

github-actions[bot] avatar May 24 '23 23:05 github-actions[bot]

I am still facing an issue and getting the same error, here is the script I am running: $ python server.py --listen --wbits 4 --model MetaIX_GPT4-X-Alpaca-30B-4bit --gptq-for-llama --pre_layer 30 60 Please let me know if I missed something

kdubey22 avatar Jun 23 '23 11:06 kdubey22