llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

CUDA Memory Allocation Failure and mlock Memory Lock Issue in llama-cpp-python

Open caiyuanhangDicp opened this issue 9 months ago • 0 comments

I am experiencing issues while trying to launch the deepseek-v3 model with a 671B Q2_K_L quantized version on 4 x A100 (80GB) GPUs. The model fails to load, and I receive the following errors:

  1. CUDA Memory Allocation Failure:

    ggml_backend_cuda_buffer_type_alloc_buffer: allocating 204800.00 MiB on device 0: cudaMalloc failed: out of memory
    llama_kv_cache_init: failed to allocate buffer for kv cache
    llama_init_from_model: llama_kv_cache_init() failed for self-attention cache
    

    Despite having sufficient GPU memory (80GB per A100 GPU, 4 GPUs in total), the model fails to allocate memory for the required buffers.

  2. Failed to Create Llama Context:

    ValueError: Failed to create llama_context
    

    The model fails to initialize properly, resulting in a failure to create the llama_context.

Hardware and Environment:

  • Model: deepseek-v3 671B Q2_K_L quantized version
  • GPUs: 4 x A100 80GB
  • CUDA Version: 12.2
  • System Memory: 503 GiB
  • Python Version: 3.11

What I Have Tried:

  • Ensured that all GPUs have enough memory available using nvidia-smi.
  • Reduced the batch size and used the quantized model to minimize memory usage.
  • Checked for any running processes that could occupy GPU memory and killed unnecessary processes.
  • Verified that the system has sufficient available memory and swap space.

Request:

Please advise on any additional configuration or memory optimizations that can resolve this issue or if there are known compatibility problems with large models like deepseek-v3 on multiple A100 GPUs.

`ggml_backend_cuda_buffer_type_alloc_buffer: allocating 204800.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_init_from_model: llama_kv_cache_init() failed for self-attention cache
2025-02-24 15:41:42,552 xinference.core.worker 711329 ERROR    Failed to load model deepseek-v3-0
Traceback (most recent call last):
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/worker.py", line 926, in launch_builtin_model
    await model_ref.load()
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/model.py", line 464, in load
    self._model.load()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/model/llm/llama_cpp/core.py", line 144, in load
    self._llm = Llama(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/llama.py", line 393, in __init__
    internals.LlamaContext(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/_internals.py", line 255, in __init__
    raise ValueError("Failed to create llama_context")
    ^^^^^^^^^^^^^^^^^
ValueError: [address=0.0.0.0:33277, pid=711347] Failed to create llama_context
2025-02-24 15:41:43,116 xinference.core.worker 711329 ERROR    [request afd74ed2-f282-11ef-8afd-6cb3117bb150] Leave launch_builtin_model, error: [address=0.0.0.0:33277, pid=711347] Failed to create llama_context, elapsed time: 44 s
Traceback (most recent call last):
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/utils.py", line 93, in wrapped
    ret = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/worker.py", line 926, in launch_builtin_model
    await model_ref.load()
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/model.py", line 464, in load
    self._model.load()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/model/llm/llama_cpp/core.py", line 144, in load
    self._llm = Llama(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/llama.py", line 393, in __init__
    internals.LlamaContext(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/_internals.py", line 255, in __init__
    raise ValueError("Failed to create llama_context")
    ^^^^^^^^^^^^^^^^^
ValueError: [address=0.0.0.0:33277, pid=711347] Failed to create llama_context
2025-02-24 15:41:43,133 xinference.api.restful_api 711193 ERROR    [address=0.0.0.0:33277, pid=711347] Failed to create llama_context
Traceback (most recent call last):
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/api/restful_api.py", line 1002, in launch_model
    model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 1190, in launch_builtin_model
    await _launch_model()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 1125, in _launch_model
    subpool_address = await _launch_one_model(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 1083, in _launch_one_model
    subpool_address = await worker_ref.launch_builtin_model(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/utils.py", line 93, in wrapped
    ret = await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/worker.py", line 926, in launch_builtin_model
    await model_ref.load()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/model.py", line 464, in load
    self._model.load()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/model/llm/llama_cpp/core.py", line 144, in load
    self._llm = Llama(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/llama.py", line 393, in __init__
    internals.LlamaContext(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/_internals.py", line 255, in __init__
    raise ValueError("Failed to create llama_context")
    ^^^^^^^^^^^^^^^^^
ValueError: [address=0.0.0.0:33277, pid=711347] Failed to create llama_context`

caiyuanhangDicp avatar Feb 24 '25 08:02 caiyuanhangDicp