ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

Crashes with DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL (maybe others)

Open markussiebert opened this issue 5 months ago • 4 comments

Describe the bug llama.cpp crashes with assertion failure and longjmp stack frame errors when processing large context with the DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL model on Intel A770 GPU. The crash occurs in the SYCL scaled dot-product attention (SDP) XMX kernel after prompt processing completes.

How to reproduce Steps to reproduce the error:

  1. Launch the DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL model using llama.cpp portable
  2. Load a large context (specific size not mentioned, but appears to be >185 tokens based on the log)
  3. Process the prompt through completion
  4. The crash occurs after prompt processing finishes with assertion failure in sdp_xmx_kernel.cpp:439

Screenshots

slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 185, n_tokens = 185, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 185, n_tokens = 185
llama-server-bin: /home/runner/_work/llm.cpp/llm.cpp/llm.cpp/bigdl-core-xe/llama_backend/sdp_xmx_kernel.cpp:439: auto ggml_sycl_op_sdp_xmx_casual(fp16 *, fp16 *, fp16 *, fp16 *, fp16 *, float *, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, float *, float, sycl::queue &)::(anonymous class)::operator()() const: Assertion `false' failed.
*** longjmp causes uninitialized stack frame ***: terminated
[Multiple longjmp errors repeated]

Environment information GPU: Intel A770 Model: DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL Runtime: llama.cpp portable

Additional context The A770 performs well with llama.cpp portable in general, but this specific crash occurs consistently when processing large contexts with this model as the initial prompt. Interestingly, if the conversation is started with a simple message (e.g., "hi") and the large context is provided as a subsequent prompt, the model handles it without issues. This suggests the crash may be related to initial memory allocation, context initialization, or the KV cache setup when processing large contexts from a cold start. The error appears to be related to the SYCL implementation of the scaled dot-product attention kernel for Intel XMX (Xe Matrix Extensions), possibly involving uninitialized state when handling large initial contexts.

markussiebert avatar Jun 17 '25 19:06 markussiebert

Hi @markussiebert, thank you for the information. We will work on reproducing this error.

cyita avatar Jun 18 '25 02:06 cyita

Hi @markussiebert, could you please provide the output of lspci -nn |grep -Ei 'VGA|DISPLAY’?

cyita avatar Jun 18 '25 10:06 cyita

03:00.0 VGA compatible controller [0300]: Intel Corporation DG2 [Arc A770] [8086:56a0] (rev 08)

Another model: jan-nano-4b-Q8_0

main: server is listening on http://127.0.0.1:5802 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
[INFO] <jan-nano-4b-Q8_0> Health check passed on http://localhost:5802/health
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 128
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 128, n_tokens = 128, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 128, n_tokens = 128
srv  params_from_: Chat format: Content-only
llama-server-bin: /home/runner/_work/llm.cpp/llm.cpp/llm.cpp/bigdl-core-xe/llama_backend/sdp_xmx_kernel.cpp:439: auto ggml_sycl_op_sdp_xmx_casual(fp16 *, fp16 *, fp16 *, fp16 *, fp16 *, float *, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, size_t, float *, float, sycl::queue &)::(anonymous class)::operator()() const: Assertion `false' failed.
*** longjmp causes uninitialized stack frame ***: terminated
*** longjmp causes uninitialized stack frame ***: terminated
*** longjmp causes uninitialized stack frame ***: terminated
*** longjmp causes uninitialized stack frame ***: terminated
*** longjmp causes uninitialized stack frame ***: terminated
*** longjmp causes uninitialized stack frame ***: terminated
*** longjmp causes uninitialized stack frame ***: terminated

the same prompt after a simple "hi" with answer "Hello! How can I assist you today? 😊" produces this logs:

main: server is listening on http://127.0.0.1:5802 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /health 127.0.0.1 200
[INFO] <jan-nano-4b-Q8_0> Health check passed on http://localhost:5802/health
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 9
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 9, n_tokens = 9, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 9, n_tokens = 9
slot      release: id  0 | task 0 | stop processing: n_past = 127, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     149.40 ms /     9 tokens (   16.60 ms per token,    60.24 tokens per second)
       eval time =    5907.96 ms /   119 tokens (   49.65 ms per token,    20.14 tokens per second)
      total time =    6057.36 ms /   128 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
[INFO] Request 192.168.10.30 "POST /v1/chat/completions HTTP/1.1" 200 29513 "Python/3.11 aiohttp/3.11.11" 21.317153975s
[INFO] Request 192.168.10.30 "GET /v1/models HTTP/1.1" 200 325 "Python/3.11 aiohttp/3.11.11" 34.579µs
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 120 | processing task
slot update_slots: id  0 | task 120 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 228
slot update_slots: id  0 | task 120 | kv cache rm [3, end)
slot update_slots: id  0 | task 120 | prompt processing progress, n_past = 228, n_tokens = 225, progress = 0.986842
slot update_slots: id  0 | task 120 | prompt done, n_past = 228, n_tokens = 225
[INFO] Request 192.168.10.30 "GET /v1/models HTTP/1.1" 200 325 "Python/3.11 aiohttp/3.11.11" 42.165µs
[INFO] Request 192.168.10.30 "GET /v1/models HTTP/1.1" 200 325 "Python/3.11 aiohttp/3.11.11" 43.146µs
slot      release: id  0 | task 120 | stop processing: n_past = 428, truncated = 0
slot print_timing: id  0 | task 120 |
prompt eval time =     318.67 ms /   225 tokens (    1.42 ms per token,   706.06 tokens per second)
       eval time =   10080.44 ms /   201 tokens (   50.15 ms per token,    19.94 tokens per second)
      total time =   10399.11 ms /   426 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
[INFO] Request 192.168.10.30 "POST /v1/chat/completions HTTP/1.1" 200 1553 "Python/3.11 aiohttp/3.11.11" 10.403606692s
[INFO] Request 192.168.10.30 "GET /v1/models HTTP/1.1" 200 325 "Python/3.11 aiohttp/3.11.11" 39.151µs
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 322 | processing task
slot update_slots: id  0 | task 322 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 298
slot update_slots: id  0 | task 322 | kv cache rm [6, end)
slot update_slots: id  0 | task 322 | prompt processing progress, n_past = 298, n_tokens = 292, progress = 0.979866
slot update_slots: id  0 | task 322 | prompt done, n_past = 298, n_tokens = 292
[INFO] Request 192.168.10.30 "GET /v1/models HTTP/1.1" 200 325 "Python/3.11 aiohttp/3.11.11" 50.779µs
srv  params_from_: Chat format: Content-only
slot      release: id  0 | task 322 | stop processing: n_past = 1089, truncated = 0
slot print_timing: id  0 | task 322 |
prompt eval time =     324.00 ms /   292 tokens (    1.11 ms per token,   901.23 tokens per second)
       eval time =   40354.81 ms /   792 tokens (   50.95 ms per token,    19.63 tokens per second)
      total time =   40678.81 ms /  1084 tokens
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
[INFO] Request 192.168.10.30 "POST /v1/chat/completions HTTP/1.1" 200 3920 "Python/3.11 aiohttp/3.11.11" 40.682091653s
slot launch_slot_: id  0 | task 993 | processing task
slot update_slots: id  0 | task 993 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 898
slot update_slots: id  0 | task 993 | kv cache rm [3, end)
slot update_slots: id  0 | task 993 | prompt processing progress, n_past = 898, n_tokens = 895, progress = 0.996659
slot update_slots: id  0 | task 993 | prompt done, n_past = 898, n_tokens = 895
[INFO] Request 192.168.10.30 "GET /v1/models HTTP/1.1" 200 325 "Python/3.11 aiohttp/3.11.11" 80.119µs
srv  params_from_: Chat format: Content-only
slot      release: id  0 | task 993 | stop processing: n_past = 1886, truncated = 0
slot print_timing: id  0 | task 993 |
prompt eval time =     379.18 ms /   895 tokens (    0.42 ms per token,  2360.34 tokens per second)
       eval time =   51371.58 ms /   989 tokens (   51.94 ms per token,    19.25 tokens per second)
      total time =   51750.76 ms /  1884 tokens
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
[INFO] Request 192.168.10.30 "POST /v1/chat/completions HTTP/1.1" 200 242992 "Python/3.11 aiohttp/3.11.11" 58.074999176s
slot launch_slot_: id  0 | task 1117 | processing task
slot update_slots: id  0 | task 1117 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 189
slot update_slots: id  0 | task 1117 | kv cache rm [3, end)
slot update_slots: id  0 | task 1117 | prompt processing progress, n_past = 189, n_tokens = 186, progress = 0.984127
slot update_slots: id  0 | task 1117 | prompt done, n_past = 189, n_tokens = 186
[INFO] Request 192.168.10.30 "GET /v1/models HTTP/1.1" 200 325 "Python/3.11 aiohttp/3.11.11" 50.512µs
srv  params_from_: Chat format: Content-only
slot      release: id  0 | task 1117 | stop processing: n_past = 430, truncated = 0
slot print_timing: id  0 | task 1117 |
prompt eval time =     267.31 ms /   186 tokens (    1.44 ms per token,   695.82 tokens per second)
       eval time =   12192.91 ms /   242 tokens (   50.38 ms per token,    19.85 tokens per second)
      total time =   12460.22 ms /   428 tokens
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
slot launch_slot_: id  0 | task 2107 | processing task
slot update_slots: id  0 | task 2107 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 1202
[INFO] Request 192.168.10.30 "POST /v1/chat/completions HTTP/1.1" 200 1737 "Python/3.11 aiohttp/3.11.11" 1m4.19400526s
slot update_slots: id  0 | task 2107 | kv cache rm [6, end)
slot update_slots: id  0 | task 2107 | prompt processing progress, n_past = 1202, n_tokens = 1196, progress = 0.995008
slot update_slots: id  0 | task 2107 | prompt done, n_past = 1202, n_tokens = 1196
slot      release: id  0 | task 2107 | stop processing: n_past = 1694, truncated = 0
slot print_timing: id  0 | task 2107 |
prompt eval time =     473.18 ms /  1196 tokens (    0.40 ms per token,  2527.61 tokens per second)
       eval time =   25477.64 ms /   493 tokens (   51.68 ms per token,    19.35 tokens per second)
      total time =   25950.82 ms /  1689 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
[INFO] Request 192.168.10.30 "POST /v1/chat/completions HTTP/1.1" 200 2914 "Python/3.11 aiohttp/3.11.11" 38.389451498s

markussiebert avatar Jun 18 '25 22:06 markussiebert

Hi @markussiebert , sadly we can not reproduce this error on our A770 machine with https://github.com/ipex-llm/ipex-llm/releases/download/v2.3.0-nightly/llama-cpp-ipex-llm-2.3.0b20250612-ubuntu-core.tgz. And below is my log & cmd.

export ONEAPI_DEVICE_SELECTOR=level_zero:0
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
./llama-server -m /mnt/disk1/models/DeepSeek-R1-0528-Qwen3-8B-UD-Q8_K_XL.gguf -ngl 99 -t 8 -c 1024 -np 1 --no-mmap
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 1024, n_keep = 0, n_prompt_tokens = 128
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 128, n_tokens = 128, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 128, n_tokens = 128
slot      release: id  0 | task 0 | stop processing: n_past = 427, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     268.75 ms /   128 tokens (    2.10 ms per token,   476.27 tokens per second)
       eval time =    9881.13 ms /   300 tokens (   32.94 ms per token,    30.36 tokens per second)
      total time =   10149.88 ms /   428 tokens
srv  update_slots: all slots are idle

rnwang04 avatar Jun 19 '25 08:06 rnwang04

I've confirmed that the issue was with my previous mainboard and CPU combination (H97 and Xeon E3-1230V3). After upgrading to a 10th generation Intel i5 and a B650 mainboard, the performance has significantly improved and everything is functioning smoothly.

markussiebert avatar Jun 24 '25 07:06 markussiebert