sglang [Bug] Qwen2 Eagle serving error

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[ ] 5. Please use English, otherwise it will be closed.

Describe the bug

I launch a qwen2 7B model endpoint with eagle.

python -m sglang.launch_server --model-path /demo-huabei2/common-models/DeepSeek-R1-Distill-Qwen-7B --disable-radix-cache --host 127.0.0.1 --port 1235 --tensor-parallel-size 1 --speculative-algo EAGLE --speculative-draft /demo-huabei2/common-models/EAGLE/EAGLE-Qwen2-7B-Instruct --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --disable-cuda-graph

after the server successfully laucnhed, I send a benchmark client.

python3 -m sglang.bench_serving --backend sglang --dataset-name random --request-rate 4 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --random-input 1024 --random-output 1024 --port 1235 --num-prompts 100 --random-range-ratio 1.0

[2025-02-10 09:02:45 TP0] Decode batch. #running-req: 100, #token: 189024, token usage: 0.19, accept len: 1.17, gen throughput (token/s): 138.60, #queue-req: 0 [2025-02-10 09:02:53 TP0] Scheduler hit an exception: Traceback (most recent call last): File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/managers/scheduler.py", line 1798, in run_scheduler_process scheduler.event_loop_normal() File "/root/miniconda3/envs/eagle/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal result = self.run_batch(batch) File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/managers/scheduler.py", line 1088, in run_batch ) = self.draft_worker.forward_batch_speculative_generation(batch) File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/eagle_worker.py", line 105, in forward_batch_speculative_generation ) = self.verify(batch, spec_info) File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/eagle_worker.py", line 253, in verify res = spec_info.verify(batch, logits_output) File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/eagle_utils.py", line 371, in verify accept_index_cpu = accept_index.tolist() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Reproduction

python -m sglang.launch_server --model-path /demo-huabei2/common-models/DeepSeek-R1-Distill-Qwen-7B --disable-radix-cache --host 127.0.0.1 --port 1235 --tensor-parallel-size 1 --speculative-algo EAGLE --speculative-draft /demo-huabei2/common-models/EAGLE/EAGLE-Qwen2-7B-Instruct --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --disable-cuda-graph

python3 -m sglang.bench_serving --backend sglang --dataset-name random --request-rate 4 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --random-input 1024 --random-output 1024 --port 1235 --num-prompts 100 --random-range-ratio 1.0

Environment

INFO 02-10 09:10:04 init.py:190] Automatically detected platform cuda. Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3: NVIDIA H20 GPU 0,1,2,3 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.131 CUDA Driver Version: 535.161.08 PyTorch: 2.5.1+cu124 sglang: 0.4.2.post4 sgl_kernel: 0.0.3.post3 flashinfer: 0.2.0.post2+cu124torch2.5 triton: 3.1.0 transformers: 4.48.3 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.12 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.22.3 orjson: 3.10.15 packaging: 24.2 psutil: 6.1.1 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.7.2 openai: 1.61.1 tiktoken: 0.8.0 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA AffinityGPU NUMA ID GPU0 X NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYS SYS 0-47 0 N/A GPU1 NV18 X NV18 NV18 NODE PIX NODE NODE SYS SYS SYS SYS 0-47 0 N/A GPU2 NV18 NV18 X NV18 NODE NODE PIX NODE SYS SYS SYS SYS 0-47 0 N/A GPU3 NV18 NV18 NV18 X NODE NODE NODE PIX SYS SYS SYS SYS 0-47 0 N/A NIC0 PIX NODE NODE NODE X NODE NODE NODE SYS SYS SYS SYS NIC1 NODE PIX NODE NODE NODE X NODE NODE SYS SYS SYS SYS NIC2 NODE NODE PIX NODE NODE NODE X NODE SYS SYS SYS SYS NIC3 NODE NODE NODE PIX NODE NODE NODE X SYS SYS SYS SYS NIC4 SYS SYS SYS SYS SYS SYS SYS SYS X NODE NODE NODE NIC5 SYS SYS SYS SYS SYS SYS SYS SYS NODE X NODE NODE NIC6 SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE X NODE NIC7 SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_1 NIC1: mlx5_2 NIC2: mlx5_3 NIC3: mlx5_4 NIC4: mlx5_5 NIC5: mlx5_6 NIC6: mlx5_7 NIC7: mlx5_8

ulimit soft: 1048576

Feb 10 '25 09:02 feifeibear

I pull the latest main branch with the triton backend

 python -m sglang.launch_server --model-path /demo-huabei2/common-models/DeepSeek-R1-Distill-Qwen-7B --disable-radix-cache --host 127.0.0.1 --port 1235 --tensor-parallel-size 1 --speculative-algo EAGLE --speculative-draft /demo-huabei2/common-models/EAGLE/EAGLE-Qwen2-7B-Instruct --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --disable-cuda-graph **--attention-backend triton**

I still encounter the following errors.

https://github.com/sgl-project/sglang/pull/3466

[2025-02-11 02:25:21 TP0] Scheduler hit an exception: Traceback (most recent call last): File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process scheduler.event_loop_normal() File "/root/miniconda3/envs/eagle/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal result = self.run_batch(batch) File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/managers/scheduler.py", line 1089, in run_batch ) = self.draft_worker.forward_batch_speculative_generation(batch) File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/eagle_worker.py", line 111, in forward_batch_speculative_generation spec_info: EagleVerifyInput = self.draft(batch) File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/eagle_worker.py", line 194, in draft ret = EagleVerifyInput.create( File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create build_tree_kernel( File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 168, in build_tree_kernel retrive_index = retrive_index[index] RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2025-02-11 02:25:21] Received sigquit from a child proces. It usually means the child failed.

Feb 11 '25 02:02 feifeibear

It seems that DeepSeek-R1-Distill-Qwen-7B is based on Qwen2.5, but you are using a draft from Qwen2. Could this be the reason? Have you looked into it?

Feb 12 '25 19:02 Swipe4057

Qwen 2.5 and Qwen 2 are the same structure.

Feb 13 '25 02:02 feifeibear

I start the service using the following command:

python3 -m sglang.launch_server --model-path /path/to/Qwen2.5-Coder-7B-Instruct --context-length 16384 --tp 1 --speculative-algorithm EAGLE --speculative-draft-model-path  /path/to/EAGLE-Qwen2-7B-Instruct --mem-fraction-static 0.5 --cuda-graph-max-bs 8 --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64

It gets stuck when generating:

[2025-04-01 11:46:50] ERROR:    Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
  File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
  File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
  File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 447, in sigquit_handler
    kill_process_tree(os.getpid())
  File "/sgl-workspace/sglang/python/sglang/srt/utils.py", line 658, in kill_process_tree
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 699, in lifespan
    await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
  File "/usr/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError

Apr 01 '25 12:04 lambert0312

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Jun 01 '25 00:06 github-actions[bot]