text-generation-webui
text-generation-webui copied to clipboard
Unexpected behavior: Model loads on single GPU but fails on dual GPU setup
Describe the bug
I'm experiencing an unexpected behavior when trying to load the following model:
Model name: Mistral-Large-Instruct-2407-IMat-GGUF Quantization: Q6_K, size 100.59GB
When using a single GPU (RTX 3080 with 12GB VRAM), the software successfully loads 40 layers of the model, utilizing oversubscription and virtual memory management. However, when both GPUs are present (providing a total of 36GB VRAM), the software fails to load the same 40 layers of the model, reporting an out-of-memory error.
Issue Details:
- Dual GPU attempt:
When trying to load 40 layers of the model across both GPUs, I receive the following error:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
[...]
llm_load_tensors: ggml ctx size = 1.12 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 30321.38 MiB on device 1: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
16:55:00-419587 ERROR Failed to load the model.
I also tried to set the tensor split to (100,0), but no luck:
llm_load_tensors: ggml ctx size = 0.74 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 43316.25 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
16:42:23-312737 ERROR Failed to load the model.
- Single GPU success: Surprisingly, when attempting to load the model on a single GPU (tested with an RTX 3080 enabled, and RTX 3090 disabled), it succeeds. Here's the relevant output:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.74 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloaded 40/89 layers to GPU
llm_load_tensors: CPU buffer size = 22780.36 MiB
llm_load_tensors: CPU buffer size = 22741.03 MiB
llm_load_tensors: CPU buffer size = 6773.11 MiB
llm_load_tensors: CPU buffer size = 1142.02 MiB
llm_load_tensors: CUDA0 buffer size = **43316.25 MiB**
Expected Behavior:
The model should load successfully 40 layers using the dual GPU setup, which has more combined VRAM (36GB) than the single RTX 3080 (12GB).
Questions:
- Why does the model fail to load 40 layers on the dual GPU setup but succeed on a single, less powerful GPU?
- Is there a configuration or setting that needs to be adjusted for optimal multi-GPU usage?
- Could this be related to memory management, CUDA optimization, or framework-specific issues?
Any insights or suggestions for troubleshooting this issue would be greatly appreciated.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
- Load the following model in dual GPU setup:
https://huggingface.co/legraphista/Mistral-Large-Instruct-2407-IMat-GGUF/tree/main/Mistral-Large-Instruct-2407.Q6_K
With the following settings:
Model loader: llama.cpp n-gpu-layers: 40 n_ctx: 3072
-
The error should happen.
-
Now go to your device manager, and disable one of the GPUs. Repeat step 1, and the issue does not happens.
Screenshot
No response
Logs
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 0.74 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 43316.25 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
16:58:19-702735 ERROR Failed to load the model.
Traceback (most recent call last):
File "C:\AI\text-generation-webui-1.10.1\modules\ui_model_menu.py", line 231, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\text-generation-webui-1.10.1\modules\models.py", line 93, in load_model
output = load_func_map[loader](model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\text-generation-webui-1.10.1\modules\models.py", line 274, in llamacpp_loader
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\AI\text-generation-webui-1.10.1\modules\llamacpp_model.py", line 85, in from_pretrained
result.model = Llama(**params)
^^^^^^^^^^^^^^^
File "C:\AI\text-generation-webui-1.10.1\installer_files\env\Lib\site-packages\llama_cpp_cuda\llama.py", line 372, in __init__
_LlamaModel(
File "C:\AI\text-generation-webui-1.10.1\installer_files\env\Lib\site-packages\llama_cpp_cuda\_internals.py", line 55, in __init__
raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models\Mistral-Large-Instruct-2407-IMat-GGUF\Mistral-Large-Instruct-2407.Q6_K-00001-of-00005.gguf
Exception ignored in: <function Llama.__del__ at 0x0000011D87D0A7A0>
Traceback (most recent call last):
File "C:\AI\text-generation-webui-1.10.1\installer_files\env\Lib\site-packages\llama_cpp_cuda\llama.py", line 2089, in __del__
if self._lora_adapter is not None:
^^^^^^^^^^^^^^^^^^
AttributeError: 'Llama' object has no attribute '_lora_adapter'
Exception ignored in: <function LlamaCppModel.__del__ at 0x0000011D89202840>
Traceback (most recent call last):
File "C:\AI\text-generation-webui-1.10.1\modules\llamacpp_model.py", line 33, in __del__
del self.model
^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'
System Info
Windows 11
- RTX 3090 (24GB VRAM)
- RTX 3080 (12GB VRAM)
- 96GB System RAM