text-generation-webui ModuleNotFoundError: No module named 'llama_inference

Describe the bug

every time i try to select a model this happens

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

i downloaded the zip

Screenshot

ModuleNotFoundError: No module named 'llama_inference_offload'

Logs

Starting the web UI...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Loading binary D:\AI\Project Hyacint\Text AI\oobabooga-windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.dll...
D:\AI\Project Hyacint\Text AI\oobabooga-windows\installer_files\env\lib\site-packages\bitsandbytes\cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
The following models are available:

1. facebook_opt-6.7b
2. gpt4-x-alpaca-13b-native-4bit-128g

Which one do you want to load? 1-2

1

Loading facebook_opt-6.7b...
Traceback (most recent call last):
  File "D:\AI\Project Hyacint\Text AI\oobabooga-windows\text-generation-webui\server.py", line 302, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "D:\AI\Project Hyacint\Text AI\oobabooga-windows\text-generation-webui\modules\models.py", line 100, in load_model
    from modules.GPTQ_loader import load_quantized
  File "D:\AI\Project Hyacint\Text AI\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 14, in <module>
    import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'
Press any key to continue . . .

System Info

i use a laptop with an i7 intel core. OS windows 10

Apr 09 '23 06:04 Hiriyuki

same on amd cpu

Apr 09 '23 07:04 doctord98

Same here on M1 Pro Mac.

EDIT: Just to know if I'm trying to do something possible or not: is there any way as of now to use gpt4-x-alpaca-13b-native-4bit-128g with WebUI on M-series macs? Anyone managed to?

Apr 09 '23 09:04 synchroazel

same here with AMD CPU + nvidia GPU

Apr 09 '23 10:04 kahnanX

That error message indicates you don't have GPTQ installed. See https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#4-bit-mode for info.

It likely won't work for anyone not using an nvidia GPU right now. CPU models might be a better option for non-nvidia-users for the time being.

Apr 09 '23 12:04 mcmonkey4eva

Same error on a Windows WSL/Ubuntu Setup

The error persists after installing the module with:

python -m pip install llama_cpp_python-0.1.26-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

This file was extracted from artifacts

python -m pip list | grep llama
llama-cpp-python         0.1.26

so llama_inference_offload is not available yet....

dependency added by PR 460

Apr 09 '23 12:04 UrielCh

@UrielCh llama-cpp-python is for CPU running, this error message comes from GPTQ which is for GPU running. You're likely trying to load a GPU model by mistake instead of a CPU model (you can recognize a CPU model by the ggml- prefix they usually have).

Apr 09 '23 12:04 mcmonkey4eva

In a project root: $ mkdir -p repositories $ cd repositories $ git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa

Try starting server.py again

Apr 09 '23 12:04 cogito123

I think I tried to start the process on my GPU:

python server.py --auto-devices --chat --wbits 4 --groupsize 128

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/uriel/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/uriel/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
/home/uriel/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/lib/wsl/lib: did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
/home/uriel/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('unix')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/uriel/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
The following models are available:

strange setup message:

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... so cuda lib is missing ...

CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so but not it's present ...

1. facebook_opt-6.7b
2. gpt4-x-alpaca-13b-native-4bit-128g
3. vicuna-13b-GPTQ-4bit-128g

Which one do you want to load? 1-3

2

Loading gpt4-x-alpaca-13b-native-4bit-128g...
Could not find the quantized model in .pt or .safetensors format, exiting...

Still one error, but I need to go for now...

I double check all my model File replacing all LFS references by the real files.

Apr 09 '23 12:04 UrielCh

even colab giving me this error Which one do you want to load? 1-2

1

Loading facebook_opt-1.3b... Traceback (most recent call last): File "/content/text-generation-webui/server.py", line 302, in shared.model, shared.tokenizer = load_model(shared.model_name) File "/content/text-generation-webui/modules/models.py", line 100, in load_model from modules.GPTQ_loader import load_quantized File "/content/text-generation-webui/modules/GPTQ_loader.py", line 14, in import llama_inference_offload ModuleNotFoundError: No module named 'llama_inference_offload'

Apr 09 '23 13:04 doctord98

It likely won't work for anyone not using an nvidia GPU right now. CPU models might be a better option for non-nvidia-users for the time being.

Is the gpt4-x-alpaca-13b-native-4bit-128g model available for CPU?

Apr 09 '23 15:04 BaptisteGarcin

In a project root: $ mkdir -p repositories $ cd repositories $ git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa

Try starting server.py again

this worked, but now im seeing an issue thats very common here, and i still dont know how to fix it, its the not enough memory issue where 8gb vram gpus cant run some models, im trying to run gpt3 x alpaca 13b 4bit 128g on an rtx 3050 8gb, amd ryzen 5 5600g.

Apr 10 '23 06:04 Iceayydev

When it ask(s) which model do you want? Of that list, which is CPU compatible? It seems we still must have a specifically designed PC to run this

Apr 10 '23 12:04 TheBlackTesla

There's a lot of CPU-compatible models out there, there's a download list for popular CPU models @ https://rentry.org/nur779 (disclaimer: I have no idea who maintains that) You can recognize that a model is CPU compatible if its files have a ggml- prefix.

Apr 10 '23 12:04 mcmonkey4eva

There's a lot of CPU-compatible models out there, there's a download list for popular CPU models @ https://rentry.org/nur779 (disclaimer: I have no idea who maintains that) You can recognize that a model is CPU compatible if its files have a ggml- prefix.

So i found this model for gpt4 x alpaca https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g/tree/main/gpt4-x-alpaca-13b-ggml-q4_1-from-gptq-4bit-128g . It has the ggml- prefix does it mean i can use it on cpu ?

Apr 10 '23 14:04 Kimhab007

Yes, yes that does, that's a CPU model.

Apr 10 '23 15:04 mcmonkey4eva

I can't fix the problem with oobabooga but for those of you who're trying to use it with CPU i got good new for you guys there's an alternative and it's very simple. it call " Koboldcpp " it's like llamacpp but with Kobold Webui you can have all the feature that oobabooga have to offer if you don't mind learning how to use the Kobold webui.

   + Installation :

Go to " https://github.com/LostRuins/koboldcpp " you can read the description if you want.
Scroll down to Usage you will see the blue Download link click on it.
You can read the description of how to use it and click download the koboldcpp.exe
Just drag and drop the model or manually search for the ggml model yourself this work for every CPU model. Next wait until it finished loading the model and copy the http://localhost:5001/ and paste it on your browser.
you can find out more about koboldcpp and how to use it here: https://www.reddit.com/r/LocalLLaMA/comments/12cfnqk/koboldcpp_combining_all_the_various_ggmlcpp_cpu/

Apr 11 '23 05:04 Kimhab007

found this for AMD gpus

and it works!!

Apr 15 '23 02:04 CristianPi

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Oct 13 '23 23:10 github-actions[bot]

text-generation-webui text-generation-webui copied to clipboard

ModuleNotFoundError: No module named 'llama_inference_offload'

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard