tabby Try to run DeepSeek-Coder-V2-Lite with 16G GPU memory and get Out of memory error

Hello, I'm trying to run tabby with model DeepSeek-Coder-V2-Lite on windows using the command: .\tabby.exe serve --model DeepSeek-Coder-V2-Lite --chat-model Qwen2-1.5B-Instruct --device cuda

and I get memory allocation error: allocating 15712.47 MiB on device 0: cudaMalloc failed: out of memory this is happening on a server with a tesla p100 video card.

however on another computer with RTX 3070 using docker the model works but very slow

Why is this happening?

Apr 15 '25 17:04 Lambda14

Hello @Lambda14, I have verified that the DeepSeek-Coder-V2-Lite possesses 16B parameters. Consequently, 16GB of memory is adequate, likely leading to Out-Of-Memory errors.

This seems to be working as expected, maybe you should use a model with fewer parameters.

Apr 18 '25 02:04 zwpaper

@zwpaper Hello, thanks for the reply, but still, why this error does not occur when running the model via docker?

Apr 18 '25 06:04 Lambda14

Ok, now i tried to run model Qwen2.5-Coder-3B without chat model it runs successfully, but when i send a request, i get an error

2025-04-19T11:35:24.909336Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:124: llama-server <completion> exited with status code -1073740791, args: `Command { std: "C:\\Users\\leo\\Desktop\\tabby_x86_64-windows-msvc-cuda124\\llama-server.exe" "-m" "C:\\Users\\leo\\.tabby\\models\\TabbyML\\Qwen2.5-Coder-3B\\ggml\\model-00001-of-00001.gguf" "--cont-batching" "--port" "30889" "-np" "1" "--ctx-size" "4096" "-ngl" "9999", kill_on_drop: true }`
Recent llama-cpp errors:

load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:        CUDA0 model buffer size =  3127.61 MiB
load_tensors:   CPU_Mapped model buffer size =   315.30 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   144.00 MiB
llama_init_from_model: KV self size  =  144.00 MiB, K (f16):   72.00 MiB, V (f16):   72.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_init_from_model:      CUDA0 compute buffer size =   300.75 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_init_from_model: graph nodes  = 1266
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: {%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0]['role'] == 'system' %}
        {{- messages[0]['content'] }}
    {%- else %}
        {{- 'You are a helpful assistant.' }}
    {%- endif %}
    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0]['role'] == 'system' %}
        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
    {%- else %}
        {{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- for message in messages %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {{- '<|im_start|>' + message.role }}
        {%- if message.content %}
            {{- '\n' + message.content }}
        {%- endif %}
        {%- for tool_call in message.tool_calls %}
            {%- if tool_call.function is defined %}
                {%- set tool_call = tool_call.function %}
            {%- endif %}
            {{- '\n<tool_call>\n{"name": "' }}
            {{- tool_call.name }}
            {{- '", "arguments": ' }}
            {{- tool_call.arguments | tojson }}
            {{- '}\n</tool_call>' }}
        {%- endfor %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- message.content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}
, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:30889 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 4
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4, n_tokens = 4, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 4, n_tokens = 4
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:73: CUDA error
ggml_cuda_compute_forward: MUL_MAT failed
CUDA error: unspecified launch failure
2025-04-19T11:35:24.947975Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:164: Attempting to restart the llama-server...

here is my nvidia-smi output, with launched tabby

Apr 19 '25 11:04 Lambda14

MUL_MAT failed is likely due to an issue in the upstream llama.cpp: https://github.com/ggml-org/llama.cpp/issues/13252

May 07 '25 04:05 zwpaper

Hi @Lambda14, What is the model of your CPU? We have encountered a failure that was caused by a CPU lacking support for certain AVX instructions.

May 15 '25 09:05 zwpaper

@zwpaper hello, cpu: E5-2690 v3

Jun 16 '25 10:06 Lambda14