FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config objec

Open dinchu opened this issue 2 years ago • 10 comments

when trying to load quantized models i always get

ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config objec

dinchu avatar Sep 21 '23 18:09 dinchu

Hi, may I ask how do you load the model, in my case with single GPU I also had that problem and I had to use disable_exllama=True while loading the model (change the config.json in your model file and add disable_exllama: true to quantization_config there if you're loading it directly from a file). When I worked with 2 GPUs, I did not have this problem. Sorry if it does not answer your question but I hope it helps. I do not know why this happens though sadly.

aliozts avatar Sep 21 '23 21:09 aliozts

It can work, Thanks. @aliozts

ilovesouthpark avatar Oct 05 '23 13:10 ilovesouthpark

Hi, may I ask how do you load the model, in my case with single GPU I also had that problem and I had to use disable_exllama=True while loading the model (change the config.json in your model file and add disable_exllama: true to quantization_config there if you're loading it directly from a file). When I worked with 2 GPUs, I did not have this problem. Sorry if it does not answer your question but I hope it helps. I do not know why this happens though sadly.

Disabling Exllama makes the entire inferencing much slower.

Check out https://github.com/PanQiWei/AutoGPTQ/issues/406 for how to enable Exllama.

tigerinus avatar Nov 07 '23 04:11 tigerinus

Why I cant run it in GPU? Although I am having NVIDIA GeForce Mx450. Could anyone please help?

Saravan004 avatar Nov 30 '23 11:11 Saravan004

I have NVIDIA GTX 1650 still getting same error

apoorvpandey0 avatar Dec 24 '23 21:12 apoorvpandey0

在config.json的quantization_config下加入"disable_exllama": true,即可解决问题。 这个错误只有单卡的时候才会出现,多卡时未出现过,使用的显卡为Tesla T4。

chenyujiang11 avatar Jan 08 '24 12:01 chenyujiang11

I was running into a similar problem running GPTQ in a docker container. I was getting disable_exllama error. In short the issue showed up when I ran the container without --gpus all command. Below is my system configs

GPU: 1660Ti transformers==4.36.2 optimum==1.16.1 auto-gptq==0.6.0+cu118 CUDA=12.3

SOLUTION: for me I fixed the disable_exllama error by running the container with --gpus all

UmiVilbig avatar Jan 10 '24 02:01 UmiVilbig

I am also facing the same issue. Disabling exllama increases the inference speed a lot, so am not sure if that's the ideal way.

Here are more details - https://github.com/lm-sys/FastChat/issues/3530

NamburiSrinath avatar Sep 19 '24 21:09 NamburiSrinath

我遇到了同样问题,我使用dbgpt项目启动模型Qwen2.5-32B-Instruct-GPTQ-Int4报错如题目所述。定位到config.json文件,将use_exllama该项从true修改为false可解决。 希望对此有帮助。

kj-1024 avatar Oct 11 '24 07:10 kj-1024

I am still getting the same ValueError even if I have added "disable_exllama": true in the config.json of my model

Monesi-dev avatar Apr 19 '25 09:04 Monesi-dev