text-generation-webui No module named 'llama_inference

Describe the bug

Try to run server.py in following: python server.py --wbits 4 --groupsize 128 and get error No module named 'llama_inference_offload' I did this fix: https://github.com/oobabooga/text-generation-webui/issues/400#issuecomment-1474876859 did not help.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Run following command: python server.py --wbits 4 --groupsize 128

Screenshot

No response

Logs

$ python server.py --wbits 4 --groupsize 128 

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/x/miniconda3/envs/textgen/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /home/x/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
Loading vicuna-13b-GPTQ-4bit-128g...
Traceback (most recent call last):
  File "/home/x/text-generation-webui/server.py", line 308, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/x/text-generation-webui/modules/models.py", line 100, in load_model
    from modules.GPTQ_loader import load_quantized
  File "/home/x/text-generation-webui/modules/GPTQ_loader.py", line 14, in <module>
    import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'

System Info

-`                    x@archlinux 
                  .o+`                   ---------------- 
                 `ooo/                   OS: Arch Linux x86_64 
                `+oooo:                  Host: X670 GAMING X AX -CF 
               `+oooooo:                 Kernel: 6.2.9-zen1-1-zen 
               -+oooooo+:                Uptime: 1 hour, 51 mins 
             `/:-:++oooo+:               Packages: 1623 (pacman), 20 (flatpak), 7 (snap) 
            `/++++/+++++++:              Shell: zsh 5.9 
           `/++++++++++++++:             Resolution: 2560x1440 
          `/+++ooooooooooooo/`           DE: Plasma 5.27.4 
         ./ooosssso++osssssso+`          WM: kwin 
        .oossssso-`/ossssss+`         WM Theme: Endless 
       -osssssso.      :ssssssso.        Theme: [Plasma], Breeze [GTK3] 
      :osssssss/        osssso+++.       Icons: [Plasma], Relax-Dark-Icons [GTK2/3] 
     /ossssssss/        +ssssooo/-       Terminal: terminator 
   `/ossssso+/:-        -:/+osssso+-     CPU: AMD Ryzen 9 7900X (24) @ 4.700GHz 
  `+sso+:-`                 `.-/+oso:    GPU: AMD ATI 16:00.0 Raphael 
 `++:.                           `-/+/   GPU: NVIDIA GeForce RTX 2080 Ti Rev. A 
 .`                                 `/   Memory: 12939MiB / 31231MiB

Apr 07 '23 11:04 Yersi88

Follow the steps here

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#installation

Apr 07 '23 13:04 oobabooga

Specifically, the llama_inference_offload function is only available in the triton branch of GPTQ-for-LLaMA

Apr 07 '23 14:04 da3dsoul

Also, I had better luck with vicuna-13b-4bit-128g on cuda. You will probably need to specify --model_type llama as well. There's a lot of trial and error here in the comments

Apr 07 '23 14:04 da3dsoul

Follow the steps here

https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#installation

This failed with following error: :( RuntimeError: The current installed version of g++ (12.2.1) is greater than the maximum required version by CUDA 11.7. Please make sure to use an adequate version of g++ (>=6.0.0, <12.0).

Apr 07 '23 14:04 Yersi88

Ah I'm guessing arch has a newer one available than Ubuntu, for example. I would install g++ manually, staying on 11.x, then try

Apr 07 '23 14:04 da3dsoul

https://github.com/oobabooga/text-generation-webui/issues/850 looks relevant to that

Apr 07 '23 14:04 da3dsoul

Follow the steps here https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#installation

This failed with following error: :( RuntimeError: The current installed version of g++ (12.2.1) is greater than the maximum required version by CUDA 11.7. Please make sure to use an adequate version of g++ (>=6.0.0, <12.0).

I had the same issue with fedora 37. To fix this I did the following.

conda install -c conda-forge gxx

If that doesn't work try

conda install gcc_linux-64==11.2.0

Apr 09 '23 03:04 kodicw

same error on windows 11

Apr 10 '23 16:04 Sotonya

Got this working on Arch. Here are the steps:

git clone https://github.com/oobabooga/text-generation-webui.git
sudo pacman -S rocm-hip-sdk python-tqdm
cd text-generation-webui
python -m venv --system-site-packages venv
export PATH=/opt/rocm/bin:$PATH
export HSA_OVERRIDE_GFX_VERSION=10.3.0 HCC_AMDGPU_TARGET=gfx1030
python -m venv --system-site-packages venv
source venv/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
mkdir repositories && cd repositories
git clone https://github.com/agrocylo/bitsandbytes-rocm
cd bitsandbytes-rocm
make hip
python setup.py install
cd ..
git clone https://github.com/WapaMario63/GPTQ-for-LLaMa-ROCm GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
python setup_rocm.py install
cd ../..
python download-model.py anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g
rm models/anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g/gpt-x-alpaca-13b-native-4bit-128g.pt (needed or it will just spam random numbers)
pip install -r requirements.txt
python server.py --wbits 4 --groupsize 128

That should do it. Just did this with a fresh install so it should not be missing any steps. If you are using the Nvidia card steps should be the same but you will install the nvidia equivalent of rocm-hip-sdk and use the normal GPTQ and bitsandbytes repos which instructions are in this repos wiki for.

Apr 10 '23 18:04 Nazushvel

python3 setup_cuda.py install failed with error: command '/usr/bin/nvcc' failed with exit code 1

EDIT: Seem like this might be a problem with having mismatched cuda and nvcc versions. Fixed by reinstalling Linux and Installing cuda toolkit with nvcc using this script https://gist.github.com/X-TRON404/e9cab789041ef03bcba13da1d5176e28

(You probably don't need to reinstall linux, i just did it out of frustration and found out the script afterward . Running that script should work as it will delete all previously installed drivers for you.)

Full output:

running install
/home/ass/miniconda3/envs/textgen/lib/python3.10/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/home/ass/miniconda3/envs/textgen/lib/python3.10/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing quant_cuda.egg-info/PKG-INFO
writing dependency_links to quant_cuda.egg-info/dependency_links.txt
writing top-level names to quant_cuda.egg-info/top_level.txt
/home/ass/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'quant_cuda.egg-info/SOURCES.txt'
writing manifest file 'quant_cuda.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
/home/ass/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py:388: UserWarning: The detected CUDA version (11.5) has a minor version mismatch with the version that was used to compile PyTorch (11.7). Most likely this shouldn't be a problem.
  warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
building 'quant_cuda' extension
gcc -pthread -B /home/ass/miniconda3/envs/textgen/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/ass/miniconda3/envs/textgen/include -fPIC -O2 -isystem /home/ass/miniconda3/envs/textgen/include -fPIC -I/home/ass/.local/lib/python3.10/site-packages/torch/include -I/home/ass/.local/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/ass/.local/lib/python3.10/site-packages/torch/include/TH -I/home/ass/.local/lib/python3.10/site-packages/torch/include/THC -I/home/ass/miniconda3/envs/textgen/include/python3.10 -c quant_cuda.cpp -o build/temp.linux-x86_64-cpython-310/quant_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/usr/bin/nvcc -I/home/ass/.local/lib/python3.10/site-packages/torch/include -I/home/ass/.local/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/ass/.local/lib/python3.10/site-packages/torch/include/TH -I/home/ass/.local/lib/python3.10/site-packages/torch/include/THC -I/home/ass/miniconda3/envs/textgen/include/python3.10 -c quant_cuda_kernel.cu -o build/temp.linux-x86_64-cpython-310/quant_cuda_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17
/home/ass/.local/lib/python3.10/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
          detected during:
            instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]" 
(61): here
            instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=size_t, one_sided=false, <unnamed>=0]" 
/home/ass/.local/lib/python3.10/site-packages/torch/include/c10/core/TensorImpl.h(77): here

/home/ass/.local/lib/python3.10/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
          detected during:
            instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator==(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]" 
(61): here
            instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, <unnamed>>::operator!=(const c10::detail::integer_iterator<I, one_sided, <unnamed>> &) const [with I=std::size_t, one_sided=true, <unnamed>=0]" 
/home/ass/.local/lib/python3.10/site-packages/torch/include/ATen/core/qualified_name.h(73): here

/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^ 
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^ 
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
error: command '/usr/bin/nvcc' failed with exit code 1

Apr 13 '23 09:04 FloriMaster

When facing the original problem, I somehow missed the need for the GPTQ-for-LLaMa directory to be inside of the repositories dir and had GPTQ-for-LLaMa placed just in the root of text generation web ui, which caused the problem.

Make sure the hierarchy of directories goes like this: text-generation-webui/repositories/GPTQ-for-LLaMa, and not like this: text-generation-webui/GPTQ-for-LLaMa. Relevant source line is here.

Hope this helps!

Apr 26 '23 10:04 fuzzah

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Oct 16 '23 23:10 github-actions[bot]

text-generation-webui
text-generation-webui copied to clipboard

No module named 'llama_inference_offload' on Arch Linux

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui text-generation-webui copied to clipboard

No module named 'llama_inference_offload' on Arch Linux

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard