ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config objec
when trying to load quantized models i always get
ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config objec
Hi, may I ask how do you load the model, in my case with single GPU I also had that problem and I had to use disable_exllama=True while loading the model (change the config.json in your model file and add disable_exllama: true to quantization_config there if you're loading it directly from a file). When I worked with 2 GPUs, I did not have this problem. Sorry if it does not answer your question but I hope it helps. I do not know why this happens though sadly.
It can work, Thanks. @aliozts
Hi, may I ask how do you load the model, in my case with single GPU I also had that problem and I had to use
disable_exllama=Truewhile loading the model (change theconfig.jsonin your model file and adddisable_exllama: truetoquantization_configthere if you're loading it directly from a file). When I worked with 2 GPUs, I did not have this problem. Sorry if it does not answer your question but I hope it helps. I do not know why this happens though sadly.
Disabling Exllama makes the entire inferencing much slower.
Check out https://github.com/PanQiWei/AutoGPTQ/issues/406 for how to enable Exllama.
Why I cant run it in GPU? Although I am having NVIDIA GeForce Mx450. Could anyone please help?
I have NVIDIA GTX 1650 still getting same error
在config.json的quantization_config下加入"disable_exllama": true,即可解决问题。 这个错误只有单卡的时候才会出现,多卡时未出现过,使用的显卡为Tesla T4。
I was running into a similar problem running GPTQ in a docker container. I was getting disable_exllama error. In short the issue showed up when I ran the container without --gpus all command. Below is my system configs
GPU: 1660Ti transformers==4.36.2 optimum==1.16.1 auto-gptq==0.6.0+cu118 CUDA=12.3
SOLUTION: for me I fixed the disable_exllama error by running the container with --gpus all
I am also facing the same issue. Disabling exllama increases the inference speed a lot, so am not sure if that's the ideal way.
Here are more details - https://github.com/lm-sys/FastChat/issues/3530
我遇到了同样问题,我使用dbgpt项目启动模型Qwen2.5-32B-Instruct-GPTQ-Int4报错如题目所述。定位到config.json文件,将use_exllama该项从true修改为false可解决。 希望对此有帮助。
I am still getting the same ValueError even if I have added "disable_exllama": true in the config.json of my model