text-generation-webui Unexpected behavior: Model loads on single GPU but fails on dual GPU setup

Unexpected behavior: Model loads on single GPU but fails on dual GPU setup

Open ro99 opened this issue 7 months ago • 2 comments

Describe the bug

I'm experiencing an unexpected behavior when trying to load the following model:

Model name: Mistral-Large-Instruct-2407-IMat-GGUF Quantization: Q6_K, size 100.59GB

When using a single GPU (RTX 3080 with 12GB VRAM), the software successfully loads 40 layers of the model, utilizing oversubscription and virtual memory management. However, when both GPUs are present (providing a total of 36GB VRAM), the software fails to load the same 40 layers of the model, reporting an out-of-memory error.

Issue Details:

Dual GPU attempt:

When trying to load 40 layers of the model across both GPUs, I receive the following error:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
[...]
llm_load_tensors: ggml ctx size =    1.12 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 30321.38 MiB on device 1: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
16:55:00-419587 ERROR    Failed to load the model.

I also tried to set the tensor split to (100,0), but no luck:

llm_load_tensors: ggml ctx size =    0.74 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 43316.25 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
16:42:23-312737 ERROR    Failed to load the model.

Single GPU success: Surprisingly, when attempting to load the model on a single GPU (tested with an RTX 3080 enabled, and RTX 3090 disabled), it succeeds. Here's the relevant output:

Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.74 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloaded 40/89 layers to GPU
llm_load_tensors: CPU buffer size = 22780.36 MiB
llm_load_tensors: CPU buffer size = 22741.03 MiB
llm_load_tensors: CPU buffer size = 6773.11 MiB
llm_load_tensors: CPU buffer size = 1142.02 MiB
llm_load_tensors: CUDA0 buffer size = **43316.25 MiB**

Expected Behavior:

The model should load successfully 40 layers using the dual GPU setup, which has more combined VRAM (36GB) than the single RTX 3080 (12GB).

Questions:

Why does the model fail to load 40 layers on the dual GPU setup but succeed on a single, less powerful GPU?
Is there a configuration or setting that needs to be adjusted for optimal multi-GPU usage?
Could this be related to memory management, CUDA optimization, or framework-specific issues?

Any insights or suggestions for troubleshooting this issue would be greatly appreciated.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Load the following model in dual GPU setup:

https://huggingface.co/legraphista/Mistral-Large-Instruct-2407-IMat-GGUF/tree/main/Mistral-Large-Instruct-2407.Q6_K

With the following settings:

Model loader: llama.cpp n-gpu-layers: 40 n_ctx: 3072

The error should happen.
Now go to your device manager, and disable one of the GPUs. Repeat step 1, and the issue does not happens.

Screenshot

No response

Logs

llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.74 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 43316.25 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
16:58:19-702735 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "C:\AI\text-generation-webui-1.10.1\modules\ui_model_menu.py", line 231, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\AI\text-generation-webui-1.10.1\modules\models.py", line 93, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\AI\text-generation-webui-1.10.1\modules\models.py", line 274, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\AI\text-generation-webui-1.10.1\modules\llamacpp_model.py", line 85, in from_pretrained
    result.model = Llama(**params)
                   ^^^^^^^^^^^^^^^
  File "C:\AI\text-generation-webui-1.10.1\installer_files\env\Lib\site-packages\llama_cpp_cuda\llama.py", line 372, in __init__
    _LlamaModel(
  File "C:\AI\text-generation-webui-1.10.1\installer_files\env\Lib\site-packages\llama_cpp_cuda\_internals.py", line 55, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models\Mistral-Large-Instruct-2407-IMat-GGUF\Mistral-Large-Instruct-2407.Q6_K-00001-of-00005.gguf

Exception ignored in: <function Llama.__del__ at 0x0000011D87D0A7A0>
Traceback (most recent call last):
  File "C:\AI\text-generation-webui-1.10.1\installer_files\env\Lib\site-packages\llama_cpp_cuda\llama.py", line 2089, in __del__
    if self._lora_adapter is not None:
       ^^^^^^^^^^^^^^^^^^
AttributeError: 'Llama' object has no attribute '_lora_adapter'
Exception ignored in: <function LlamaCppModel.__del__ at 0x0000011D89202840>
Traceback (most recent call last):
  File "C:\AI\text-generation-webui-1.10.1\modules\llamacpp_model.py", line 33, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

System Info

Windows 11

- RTX 3090 (24GB VRAM)
- RTX 3080 (12GB VRAM)
- 96GB System RAM

Jul 26 '24 20:07 ro99

text-generation-webui text-generation-webui copied to clipboard

Unexpected behavior: Model loads on single GPU but fails on dual GPU setup

Describe the bug

Issue Details:

Expected Behavior:

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard