[Bug] Qwen2 Eagle serving error
Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.
Describe the bug
I launch a qwen2 7B model endpoint with eagle.
python -m sglang.launch_server --model-path /demo-huabei2/common-models/DeepSeek-R1-Distill-Qwen-7B --disable-radix-cache --host 127.0.0.1 --port 1235 --tensor-parallel-size 1 --speculative-algo EAGLE --speculative-draft /demo-huabei2/common-models/EAGLE/EAGLE-Qwen2-7B-Instruct --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --disable-cuda-graph
after the server successfully laucnhed, I send a benchmark client.
python3 -m sglang.bench_serving --backend sglang --dataset-name random --request-rate 4 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --random-input 1024 --random-output 1024 --port 1235 --num-prompts 100 --random-range-ratio 1.0
[2025-02-10 09:02:45 TP0] Decode batch. #running-req: 100, #token: 189024, token usage: 0.19, accept len: 1.17, gen throughput (token/s): 138.60, #queue-req: 0
[2025-02-10 09:02:53 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/managers/scheduler.py", line 1798, in run_scheduler_process
scheduler.event_loop_normal()
File "/root/miniconda3/envs/eagle/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal
result = self.run_batch(batch)
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/managers/scheduler.py", line 1088, in run_batch
) = self.draft_worker.forward_batch_speculative_generation(batch)
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/eagle_worker.py", line 105, in forward_batch_speculative_generation
) = self.verify(batch, spec_info)
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/eagle_worker.py", line 253, in verify
res = spec_info.verify(batch, logits_output)
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/eagle_utils.py", line 371, in verify
accept_index_cpu = accept_index.tolist()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Reproduction
python -m sglang.launch_server --model-path /demo-huabei2/common-models/DeepSeek-R1-Distill-Qwen-7B --disable-radix-cache --host 127.0.0.1 --port 1235 --tensor-parallel-size 1 --speculative-algo EAGLE --speculative-draft /demo-huabei2/common-models/EAGLE/EAGLE-Qwen2-7B-Instruct --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --disable-cuda-graph
python3 -m sglang.bench_serving --backend sglang --dataset-name random --request-rate 4 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --random-input 1024 --random-output 1024 --port 1235 --num-prompts 100 --random-range-ratio 1.0
Environment
INFO 02-10 09:10:04 init.py:190] Automatically detected platform cuda. Python: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3: NVIDIA H20 GPU 0,1,2,3 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.131 CUDA Driver Version: 535.161.08 PyTorch: 2.5.1+cu124 sglang: 0.4.2.post4 sgl_kernel: 0.0.3.post3 flashinfer: 0.2.0.post2+cu124torch2.5 triton: 3.1.0 transformers: 4.48.3 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.12 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.22.3 orjson: 3.10.15 packaging: 24.2 psutil: 6.1.1 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.7.2 openai: 1.61.1 tiktoken: 0.8.0 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA AffinityGPU NUMA ID GPU0 X NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYS SYS 0-47 0 N/A GPU1 NV18 X NV18 NV18 NODE PIX NODE NODE SYS SYS SYS SYS 0-47 0 N/A GPU2 NV18 NV18 X NV18 NODE NODE PIX NODE SYS SYS SYS SYS 0-47 0 N/A GPU3 NV18 NV18 NV18 X NODE NODE NODE PIX SYS SYS SYS SYS 0-47 0 N/A NIC0 PIX NODE NODE NODE X NODE NODE NODE SYS SYS SYS SYS NIC1 NODE PIX NODE NODE NODE X NODE NODE SYS SYS SYS SYS NIC2 NODE NODE PIX NODE NODE NODE X NODE SYS SYS SYS SYS NIC3 NODE NODE NODE PIX NODE NODE NODE X SYS SYS SYS SYS NIC4 SYS SYS SYS SYS SYS SYS SYS SYS X NODE NODE NODE NIC5 SYS SYS SYS SYS SYS SYS SYS SYS NODE X NODE NODE NIC6 SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE X NODE NIC7 SYS SYS SYS SYS SYS SYS SYS SYS NODE NODE NODE X
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_1 NIC1: mlx5_2 NIC2: mlx5_3 NIC3: mlx5_4 NIC4: mlx5_5 NIC5: mlx5_6 NIC6: mlx5_7 NIC7: mlx5_8
ulimit soft: 1048576
I pull the latest main branch with the triton backend
python -m sglang.launch_server --model-path /demo-huabei2/common-models/DeepSeek-R1-Distill-Qwen-7B --disable-radix-cache --host 127.0.0.1 --port 1235 --tensor-parallel-size 1 --speculative-algo EAGLE --speculative-draft /demo-huabei2/common-models/EAGLE/EAGLE-Qwen2-7B-Instruct --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --disable-cuda-graph **--attention-backend triton**
I still encounter the following errors.
https://github.com/sgl-project/sglang/pull/3466
[2025-02-11 02:25:21 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
scheduler.event_loop_normal()
File "/root/miniconda3/envs/eagle/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal
result = self.run_batch(batch)
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/managers/scheduler.py", line 1089, in run_batch
) = self.draft_worker.forward_batch_speculative_generation(batch)
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/eagle_worker.py", line 111, in forward_batch_speculative_generation
spec_info: EagleVerifyInput = self.draft(batch)
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/eagle_worker.py", line 194, in draft
ret = EagleVerifyInput.create(
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/eagle_utils.py", line 194, in create
build_tree_kernel(
File "/demo-huabei2/fjr/code/dpskv3/download/sglang/python/sglang/srt/speculative/build_eagle_tree.py", line 168, in build_tree_kernel
retrive_index = retrive_index[index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[2025-02-11 02:25:21] Received sigquit from a child proces. It usually means the child failed.
It seems that DeepSeek-R1-Distill-Qwen-7B is based on Qwen2.5, but you are using a draft from Qwen2. Could this be the reason? Have you looked into it?
Qwen 2.5 and Qwen 2 are the same structure.
I start the service using the following command:
python3 -m sglang.launch_server --model-path /path/to/Qwen2.5-Coder-7B-Instruct --context-length 16384 --tp 1 --speculative-algorithm EAGLE --speculative-draft-model-path /path/to/EAGLE-Qwen2-7B-Instruct --mem-fraction-static 0.5 --cuda-graph-max-bs 8 --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64
It gets stuck when generating:
[2025-04-01 11:46:50] ERROR: Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 447, in sigquit_handler
kill_process_tree(os.getpid())
File "/sgl-workspace/sglang/python/sglang/srt/utils.py", line 658, in kill_process_tree
sys.exit(0)
SystemExit: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 699, in lifespan
await receive()
File "/usr/local/lib/python3.10/dist-packages/uvicorn/lifespan/on.py", line 137, in receive
return await self.receive_queue.get()
File "/usr/lib/python3.10/asyncio/queues.py", line 159, in get
await getter
asyncio.exceptions.CancelledError
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.