llama-cpp-python
llama-cpp-python copied to clipboard
CUDA Memory Allocation Failure and mlock Memory Lock Issue in llama-cpp-python
I am experiencing issues while trying to launch the deepseek-v3 model with a 671B Q2_K_L quantized version on 4 x A100 (80GB) GPUs. The model fails to load, and I receive the following errors:
-
CUDA Memory Allocation Failure:
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 204800.00 MiB on device 0: cudaMalloc failed: out of memory llama_kv_cache_init: failed to allocate buffer for kv cache llama_init_from_model: llama_kv_cache_init() failed for self-attention cacheDespite having sufficient GPU memory (80GB per A100 GPU, 4 GPUs in total), the model fails to allocate memory for the required buffers.
-
Failed to Create Llama Context:
ValueError: Failed to create llama_contextThe model fails to initialize properly, resulting in a failure to create the
llama_context.
Hardware and Environment:
- Model: deepseek-v3 671B Q2_K_L quantized version
- GPUs: 4 x A100 80GB
- CUDA Version: 12.2
- System Memory: 503 GiB
- Python Version: 3.11
What I Have Tried:
- Ensured that all GPUs have enough memory available using
nvidia-smi. - Reduced the batch size and used the quantized model to minimize memory usage.
- Checked for any running processes that could occupy GPU memory and killed unnecessary processes.
- Verified that the system has sufficient available memory and swap space.
Request:
Please advise on any additional configuration or memory optimizations that can resolve this issue or if there are known compatibility problems with large models like deepseek-v3 on multiple A100 GPUs.
`ggml_backend_cuda_buffer_type_alloc_buffer: allocating 204800.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_init_from_model: llama_kv_cache_init() failed for self-attention cache
2025-02-24 15:41:42,552 xinference.core.worker 711329 ERROR Failed to load model deepseek-v3-0
Traceback (most recent call last):
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/worker.py", line 926, in launch_builtin_model
await model_ref.load()
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/model.py", line 464, in load
self._model.load()
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/model/llm/llama_cpp/core.py", line 144, in load
self._llm = Llama(
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/llama.py", line 393, in __init__
internals.LlamaContext(
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/_internals.py", line 255, in __init__
raise ValueError("Failed to create llama_context")
^^^^^^^^^^^^^^^^^
ValueError: [address=0.0.0.0:33277, pid=711347] Failed to create llama_context
2025-02-24 15:41:43,116 xinference.core.worker 711329 ERROR [request afd74ed2-f282-11ef-8afd-6cb3117bb150] Leave launch_builtin_model, error: [address=0.0.0.0:33277, pid=711347] Failed to create llama_context, elapsed time: 44 s
Traceback (most recent call last):
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/utils.py", line 93, in wrapped
ret = await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/worker.py", line 926, in launch_builtin_model
await model_ref.load()
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/model.py", line 464, in load
self._model.load()
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/model/llm/llama_cpp/core.py", line 144, in load
self._llm = Llama(
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/llama.py", line 393, in __init__
internals.LlamaContext(
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/_internals.py", line 255, in __init__
raise ValueError("Failed to create llama_context")
^^^^^^^^^^^^^^^^^
ValueError: [address=0.0.0.0:33277, pid=711347] Failed to create llama_context
2025-02-24 15:41:43,133 xinference.api.restful_api 711193 ERROR [address=0.0.0.0:33277, pid=711347] Failed to create llama_context
Traceback (most recent call last):
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/api/restful_api.py", line 1002, in launch_model
model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 1190, in launch_builtin_model
await _launch_model()
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 1125, in _launch_model
subpool_address = await _launch_one_model(
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 1083, in _launch_one_model
subpool_address = await worker_ref.launch_builtin_model(
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/utils.py", line 93, in wrapped
ret = await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/worker.py", line 926, in launch_builtin_model
await model_ref.load()
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/model.py", line 464, in load
self._model.load()
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/model/llm/llama_cpp/core.py", line 144, in load
self._llm = Llama(
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/llama.py", line 393, in __init__
internals.LlamaContext(
^^^^^^^^^^^^^^^^^
File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/_internals.py", line 255, in __init__
raise ValueError("Failed to create llama_context")
^^^^^^^^^^^^^^^^^
ValueError: [address=0.0.0.0:33277, pid=711347] Failed to create llama_context`