lollms-webui
lollms-webui copied to clipboard
Exllama does not work with cpu only
Expected Behavior
use cpu on hugging face
Current Behavior
Using device map: cpu
Couldn't load model.
Couldn't load model. Please verify your configuration file at /mnt/games_fast/lollms_data/configs or use the next menu to select a valid model
Binding returned this exception : Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config object
Traceback (most recent call last):
File "/mnt/games_fast/lollms-webui/lollms-webui/lollms_core/lollms/app.py", line 257, in load_model
model = ModelBuilder(self.binding).get_model()
File "/mnt/games_fast/lollms-webui/lollms-webui/lollms_core/lollms/binding.py", line 597, in __init__
self.build_model()
File "/mnt/games_fast/lollms-webui/lollms-webui/lollms_core/lollms/binding.py", line 600, in build_model
self.model = self.binding.build_model()
File "/mnt/games_fast/lollms-webui/lollms-webui/zoos/bindings_zoo/hugging_face/__init__.py", line 209, in build_model
self.model = AutoModelForCausalLM.from_pretrained(str(model_path),
File "/mnt/games_fast/lollms-webui/installer_files/lollms_env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/mnt/games_fast/lollms-webui/installer_files/lollms_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3784, in from_pretrained
model = quantizer.post_init_model(model)
File "/mnt/games_fast/lollms-webui/installer_files/lollms_env/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 583, in post_init_model
raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config object
personal_models_path: /mnt/games_fast/lollms_data/models
Binding name:hugging_face
Model name:WizardCoder-Python-7B-V1.0-GPTQ
Steps to Reproduce
select hugging face select WizardCoder-Python-7B-V1.0-GPTQ select cpu in hugging face settings
Possible Solution
no idea
Context
can't use gpu since 8 GB aren't enough for most good models
Screenshots
If applicable, add screenshots to help explain the issue.
I am pretty sure exllama only works for gpu models. "A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs".
Hi. Yes I am sorry for that but exllama is a GPU only binding.