text-generation-webui Message "CUDA extension not installed" but CUDA 12 is installed on Windows

Describe the bug

I am trying to use the multimodal model wojtab_llava-13b-v0-4bit-128g on Windows using CUDA. (further developments of this issue in my comments bellow)

Is there an existing issue for this?

[X] I have searched the existing issues
https://github.com/oobabooga/text-generation-webui/issues/1289: this one is similar but the solutions are not working

Reproduction

I used the installer start_windows.bat. Then I restart with option --multimodal-pipeline llava-13b. I downloaded the model wojtab/llava-13b-v0-4bit-128g. Then loaded it with GPTQ-for-LLaMa using:

wbits=4
groupsize=128
model_type=llama

After clicking the Load button, it loads as if everything is ok, but the AI does not give any responses.

I noted, that in the console output, there is the message: CUDA extension not installed twice. Also, sending requests to the AI result in an error: NameError: name 'quant_cuda' is not defined.

Screenshot

No response

Logs

23:12:42-780091 INFO     Starting Text generation web UI
23:12:42-783600 INFO     Loading the extension "multimodal"
23:12:42-787386 INFO     Loading the extension "gallery"
23:12:42-999049 INFO     LLaVA - Loading CLIP from openai/clip-vit-large-patch14 as torch.float32 on cuda:0...
23:12:44-963047 INFO     LLaVA - Loading projector from liuhaotian/LLaVA-13b-delta-v0 as torch.float32 on cuda:0...
23:12:45-158792 INFO     LLaVA supporting models loaded, took 2.16 seconds
23:12:45-160295 INFO     Multimodal: loaded pipeline llava-13b from pipelines/llava (LLaVA_v0_13B_Pipeline)

Running on local URL:  http://127.0.0.1:7860

23:13:10-235869 INFO     Loading "wojtab_llava-13b-v0-4bit-128g"
CUDA extension not installed.
CUDA extension not installed.
23:13:10-264514 INFO     Found the following quantized model: models\wojtab_llava-13b-v0-4bit-128g\llava-13b-v0-4bit-128g.safetensors
C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\transformers\modeling_utils.py:4193: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
23:13:21-655309 INFO     LOADER: "GPTQ-for-LLaMa"
23:13:21-658307 INFO     TRUNCATION LENGTH: 2048
23:13:21-659305 INFO     INSTRUCTION TEMPLATE: "LLaVA"
23:13:21-660305 INFO     Loaded the model in 11.42 seconds.
Traceback (most recent call last):
  File "C:\Tools\text-generation-webui-oobabooga\modules\callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\modules\text_generation.py", line 397, in generate_with_callback
    shared.model.generate(**kwargs)
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 1592, in generate
    return self.sample(
           ^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2696, in sample
    outputs = self(
              ^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\transformers\models\llama\modeling_llama.py", line 1176, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\transformers\models\llama\modeling_llama.py", line 1019, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\transformers\models\llama\modeling_llama.py", line 740, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\transformers\models\llama\modeling_llama.py", line 639, in forward
    query_states = self.q_proj(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\gptq_for_llama\gptq_old\quant.py", line 426, in forward
    quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.qzeros, self.groupsize)
    ^^^^^^^^^^
NameError: name 'quant_cuda' is not defined
Output generated in 0.59 seconds (0.00 tokens/s, 0 tokens, context 71, seed 226787340)

System Info

GPU: RTX 3060, 6GB VRAM
CPU: i7 11th gen, 64GB RAM
OS: Windows 11
CUDA: 12

Mar 18 '24 03:03 masbicudo

What I have found so far is an import that is failing in the quant.py file:

import quant_cuda_faster as quant_cuda

I then created a test file containing the import to debug. First, I got the error:

Exception has occurred: ImportError
DLL load failed while importing quant_cuda_faster: The specified module could not be found.

Then I found out that to load a module DLL, the file needs to be in the trusted DLL list. I proceeded to add env\Lib\site-packages\torch\lib in the list since I found out that the code of quant_cuda_faster depends on torch.

import os
os.add_dll_directory(r"C:\text-generation-webui\installer_files\env\Lib\site-packages\torch\lib")
import quant_cuda_faster as quant_cuda

The error now changes to:

Exception has occurred: ImportError
DLL load failed while importing quant_cuda_faster: The specified procedure could not be found.

I have no idea on how to fix this, since the error does not give any indication of which procedure it might be missing. I can only suppose that it is a torch version conflict, where the quant_cuda_faster expects a version of torch that is not the same one that is installed.

Mar 19 '24 03:03 masbicudo

I have the same issue here

Mar 19 '24 03:03 HamedEmine

I executed python -m torch.utils.collect_env in the context of cmd_windows.bat, with the result bellow. It tells that CUDA is installed.

PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Home Single Language
GCC version: (x86_64-posix-seh-rev0, Built by MinGW-Builds project) 13.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: True
CUDA runtime version: 12.4.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
Nvidia driver version: 551.76
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=2304
DeviceID=CPU0
Family=198
L2CacheSize=10240
L2CacheSpeed=
Manufacturer=GenuineIntel
MaxClockSpeed=2304
Name=11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
ProcessorType=3
Revision=

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.1+cu121
[pip3] torchaudio==2.2.1+cu121
[pip3] torchvision==0.17.1+cu121
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.2.1+cu121              pypi_0    pypi
[conda] torchaudio                2.2.1+cu121              pypi_0    pypi
[conda] torchvision               0.17.1+cu121             pypi_0    pypi

Mar 20 '24 20:03 masbicudo

Hello, I was able to resolve this by using "ExLlamav2_HF" as the loader instead of "GPTQ-for-LLaMa", make sure to click Save Settings so it uses that next time it launches.

Mar 27 '24 21:03 HamedEmine

Hello, I was able to resolve this by using "ExLlamav2_HF" as the loader instead of "GPTQ-for-LLaMa", make sure to click Save Settings so it uses that next time it launches.

@HamedEmine Yeah, I could also load this model using ExLlamav2_HF. Thanks for pointing me to this solution. It can answer text questions, but unfortunately it raised an error when I tried to send it a picture. My intention is to use the multimodal extension. I was following the instructions of the multimodal extension page, that is why I was trying to use the wojtab_llava-13b-v0-4bit-128g model.

This is the error when I try to input images (AttributeError: 'Exllamav2HF' object has no attribute 'model'):

Traceback (most recent call last):
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\gradio\queueing.py", line 501, in call_prediction
    output = await route_utils.call_process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\gradio\route_utils.py", line 258, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\gradio\blocks.py", line 1684, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\gradio\blocks.py", line 1262, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\gradio\utils.py", line 574, in async_iteration
    return await iterator.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\gradio\utils.py", line 567, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\anyio\to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\anyio\_backends\_asyncio.py", line 2144, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\anyio\_backends\_asyncio.py", line 851, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\gradio\utils.py", line 550, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\gradio\utils.py", line 733, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\modules\chat.py", line 414, in generate_chat_reply_wrapper
    for i, history in enumerate(generate_chat_reply(text, state, regenerate, _continue, loading_message=True, for_ui=True)):
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\modules\chat.py", line 382, in generate_chat_reply
    for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue, loading_message=loading_message, for_ui=for_ui):
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\modules\chat.py", line 325, in chatbot_wrapper
    for j, reply in enumerate(generate_reply(prompt, state, stopping_strings=stopping_strings, is_chat=True, for_ui=for_ui)):
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\modules\text_generation.py", line 33, in generate_reply
    for result in _generate_reply(*args, **kwargs):
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\modules\text_generation.py", line 85, in _generate_reply
    for reply in generate_func(question, original_question, seed, state, stopping_strings, is_chat=is_chat):
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\modules\text_generation.py", line 324, in generate_reply_HF
    question, input_ids, inputs_embeds = apply_extensions('tokenizer', state, question, input_ids, None)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\modules\extensions.py", line 231, in apply_extensions
    return EXTENSION_MAP[typ](*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\modules\extensions.py", line 134, in _apply_tokenizer_extensions
    prompt, input_ids, input_embeds = getattr(extension, function_name)(state, prompt, input_ids, input_embeds)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\extensions\multimodal\script.py", line 90, in tokenizer_modifier
    prompt, input_ids, input_embeds, total_embedded = multimodal_embedder.forward(prompt, state, params)
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\extensions\multimodal\multimodal_embedder.py", line 172, in forward
    prompt_parts = self._embed(prompt_parts)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\extensions\multimodal\multimodal_embedder.py", line 154, in _embed
    parts[i].embedding = self.pipeline.embed_tokens(part.input_ids)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\extensions\multimodal\pipelines\instructblip-pipeline\instructblip_pipeline.py", line 42, in embed_tokens
    return shared.model.model.embed_tokens(input_ids).to(shared.model.device, dtype=shared.model.dtype)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Sync\nb-avell\Tools\text-generation-webui-oobabooga\installer_files\env\Lib\site-packages\torch\nn\modules\module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'Exllamav2HF' object has no attribute 'model'

Mar 28 '24 01:03 masbicudo

I was able to load the model and use multimodal extension using the ExLlamav2 loader.

Mar 28 '24 01:03 masbicudo

This issue has been closed due to inactivity for 2 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

May 27 '24 23:05 github-actions[bot]

text-generation-webui text-generation-webui copied to clipboard

Message "CUDA extension not installed" but CUDA 12 is installed on Windows

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard