EAGLE3 on Qwen2.5-VL / Qwen3-VL shows extremely low accept length (accept_len ≈ 1)
Hi, thanks for your great work on SGLang and SpecForge!
I am trying to test https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl on Qwen2.5-VL using the reference configs from: https://github.com/sgl-project/SpecForge/pull/102 , but the speculative decoding performance is far below expectations.
Below is a detailed report of my setup, logs, and results.
1. My SGLang server command
python -m sglang.launch_server \
--model-path /ch/pretrained_models/Qwen2.5-VL-7B-Instruct \
--speculative-draft-model-path /ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl \
--speculative-algorithm EAGLE3 \
--speculative-num-steps 4 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 24 \
--trust-remote-code \
--chunked-prefill-size -1 \
--cuda-graph-max-bs 1 \
--tp 1 \
--mem-fraction-static 0.7 \
--host 0.0.0.0 \
--port 8080
Client benchmark:
python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 50
2. Results (Qwen2.5-VL with EAGLE3)
Average Latency: 92.421 s
Average Output throughput: 41.960 token/s
Average Accept length: 1.037
3. SGLang logs (accept length always ≈ 1)
Below are several captured decode logs:
[2025-11-19 15:25:25] Decode batch, #running-req: 1, #token: 353, token usage: 0.04, accept len: 1.02, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 44.13, #queue-req: 0,
[2025-11-19 15:25:26] Decode batch, #running-req: 1, #token: 393, token usage: 0.05, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.13, #queue-req: 0,
[2025-11-19 15:25:27] Decode batch, #running-req: 1, #token: 433, token usage: 0.05, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.12, #queue-req: 0,
[2025-11-19 15:25:28] Decode batch, #running-req: 1, #token: 474, token usage: 0.06, accept len: 1.02, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 44.20, #queue-req: 0,
[2025-11-19 15:25:29] Decode batch, #running-req: 1, #token: 514, token usage: 0.06, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.01, #queue-req: 0,
[2025-11-19 15:25:30] Decode batch, #running-req: 1, #token: 557, token usage: 0.06, accept len: 1.07, accept rate: 0.21, cuda graph: True, gen throughput (token/s): 46.21, #queue-req: 0,
This suggests that the draft model’s predictions are almost always rejected.
4. Similar behavior on Qwen3-VL
The result is essentially the same: accept_len ≈ 1.
5. However: Llama-3.1-8B + EAGLE3 works correctly
Using the same speculative settings:
- speculative-num-steps=4
- speculative-eagle-topk=6
- speculative-num-draft-tokens=24
with https://huggingface.co/lmsys/sglang-EAGLE-LLaMA3-Instruct-8B on gsm8k I get expected results:
Average Latency: 52.161 s
Average Output throughput: 86.099 token/s
Average Accept length: 2.313
So the EAGLE3 pipeline works normally on Llama models.
6. VLLM results: Qwen2.5-VL EAGLE3 behaves correctly
I also tested Qwen2.5-VL EAGLE3 in vLLM, using configs from https://github.com/vllm-project/vllm/pull/22872
Example command:
vllm serve \
/ch/pretrained_models/Qwen2.5-VL-7B-Instruct \
--port 5580 --host 0.0.0.0 \
--max-num-seqs 128 --dtype bfloat16 --max-model-len=8192 \
--no-enable-prefix-caching --trust-remote-code -tp 1\
--speculative-config '{"method": "eagle3", "model": "/ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "max_model_len": 8192}' \
--num-lookahead-slots=3 \
--gpu-memory-utilization=0.93
Results:
- with EAGLE3: Output token throughput (tok/s) = 135.67
- without EAGLE3: Output token throughput (tok/s) = 97.92
- end-to-end speedup ≈ 1.385× → ✔ expected behavior
Meaning: The Qwen2.5-VL EAGLE3 draft model itself is fine but SGLang’s integration leads to extremely low accept_len.
7. My question
- Is my configuration missing anything specific for multimodal models?
- Are additional modifications needed beyond PR #8801 to fully support Qwen VL EAGLE3?
Any guidance or hints would be greatly appreciated. Thank you very much for your help!
Thank you for your feedback. I'll test it today.
https://github.com/sgl-project/SpecForge/pull/279 Hi, it seems this issue is related to the SGLang-side integration. Could you help test this PR? You can evaluate the accept length of Qwen VL 2.5 without relying on SGLang. Currently, tree decoding is not available, and the batch size should be set to 1, with top-k = 1, i.e., only a single decoding path.
Hi, thanks for your great work on SGLang and SpecForge!
I am trying to test https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl on Qwen2.5-VL using the reference configs from: #102 , but the speculative decoding performance is far below expectations.
Below is a detailed report of my setup, logs, and results.
1. My SGLang server command
python -m sglang.launch_server
--model-path /ch/pretrained_models/Qwen2.5-VL-7B-Instruct
--speculative-draft-model-path /ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl
--speculative-algorithm EAGLE3
--speculative-num-steps 4
--speculative-eagle-topk 1
--speculative-num-draft-tokens 24
--trust-remote-code
--chunked-prefill-size -1
--cuda-graph-max-bs 1
--tp 1
--mem-fraction-static 0.7
--host 0.0.0.0
--port 8080 Client benchmark:python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 50
2. Results (Qwen2.5-VL with EAGLE3)
Average Latency: 92.421 s Average Output throughput: 41.960 token/s Average Accept length: 1.0373. SGLang logs (accept length always ≈ 1)
Below are several captured decode logs:
[2025-11-19 15:25:25] Decode batch, #running-req: 1, #token: 353, token usage: 0.04, accept len: 1.02, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 44.13, #queue-req: 0, [2025-11-19 15:25:26] Decode batch, #running-req: 1, #token: 393, token usage: 0.05, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.13, #queue-req: 0, [2025-11-19 15:25:27] Decode batch, #running-req: 1, #token: 433, token usage: 0.05, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.12, #queue-req: 0, [2025-11-19 15:25:28] Decode batch, #running-req: 1, #token: 474, token usage: 0.06, accept len: 1.02, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 44.20, #queue-req: 0, [2025-11-19 15:25:29] Decode batch, #running-req: 1, #token: 514, token usage: 0.06, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.01, #queue-req: 0, [2025-11-19 15:25:30] Decode batch, #running-req: 1, #token: 557, token usage: 0.06, accept len: 1.07, accept rate: 0.21, cuda graph: True, gen throughput (token/s): 46.21, #queue-req: 0,This suggests that the draft model’s predictions are almost always rejected.
4. Similar behavior on Qwen3-VL
The result is essentially the same: accept_len ≈ 1.
5. However: Llama-3.1-8B + EAGLE3 works correctly
Using the same speculative settings:
- speculative-num-steps=4
- speculative-eagle-topk=6
- speculative-num-draft-tokens=24
with https://huggingface.co/lmsys/sglang-EAGLE-LLaMA3-Instruct-8B on gsm8k I get expected results:
Average Latency: 52.161 s Average Output throughput: 86.099 token/s Average Accept length: 2.313So the EAGLE3 pipeline works normally on Llama models.
6. VLLM results: Qwen2.5-VL EAGLE3 behaves correctly
I also tested Qwen2.5-VL EAGLE3 in vLLM, using configs from vllm-project/vllm#22872
Example command:
vllm serve
/ch/pretrained_models/Qwen2.5-VL-7B-Instruct
--port 5580 --host 0.0.0.0
--max-num-seqs 128 --dtype bfloat16 --max-model-len=8192
--no-enable-prefix-caching --trust-remote-code -tp 1
--speculative-config '{"method": "eagle3", "model": "/ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "max_model_len": 8192}'
--num-lookahead-slots=3
--gpu-memory-utilization=0.93 Results:
- with EAGLE3: Output token throughput (tok/s) = 135.67
- without EAGLE3: Output token throughput (tok/s) = 97.92
- end-to-end speedup ≈ 1.385× → ✔ expected behavior
Meaning: The Qwen2.5-VL EAGLE3 draft model itself is fine but SGLang’s integration leads to extremely low accept_len.
7. My question
- Is my configuration missing anything specific for multimodal models?
- Are additional modifications needed beyond PR #8801 to fully support Qwen VL EAGLE3?
Any guidance or hints would be greatly appreciated. Thank you very much for your help!
I have reply your result. I think it's a bug of latest sglang Qwen2.5-VL eagle3 impl. I will try to fix it.
I test use sglang==0.5.4. The result is below which is OK.
Created temporary image directory: .cache/mmstar_specforge
Loaded 100 questions.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:44<00:00, 2.24it/s]
Warning: 15 predictions could not be extracted.
==================================================
MMSTAR Evaluation Results
==================================================
Number of questions: 100
Average Accuracy: 0.3400 (34.00%)
Average Latency: 44.846 s
Average Output throughput: 147.304 token/s
Average Accept length: 2.195
==================================================
Deleted temporary directory: .cache/mmstar_specforge
But if I use the sglang==0.5.5.post3. The average accept length is about 1.0 which is not correct.
I test use sglang==0.5.4. The result is below which is OK.
Created temporary image directory: .cache/mmstar_specforge Loaded 100 questions. 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:44<00:00, 2.24it/s] Warning: 15 predictions could not be extracted.
================================================== MMSTAR Evaluation Results
Number of questions: 100 Average Accuracy: 0.3400 (34.00%) Average Latency: 44.846 s Average Output throughput: 147.304 token/s Average Accept length: 2.195
Deleted temporary directory: .cache/mmstar_specforge But if I use the sglang==0.5.5.post3. The average accept length is about 1.0 which is not correct.
Thank you so much for your reply!
You are right, everything worked fine after I installed "sglang[all]==0.5.4", but later versions seem to have some bugs with the eagle3 method for VLM.
I test use sglang==0.5.4. The result is below which is OK.
Created temporary image directory: .cache/mmstar_specforge Loaded 100 questions. 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:44<00:00, 2.24it/s] Warning: 15 predictions could not be extracted.
================================================== MMSTAR Evaluation Results
Number of questions: 100 Average Accuracy: 0.3400 (34.00%) Average Latency: 44.846 s Average Output throughput: 147.304 token/s Average Accept length: 2.195
Deleted temporary directory: .cache/mmstar_specforge But if I use the sglang==0.5.5.post3. The average accept length is about 1.0 which is not correct.
When I test Qwen3-VL with Eagle3 (https://huggingface.co/collections/taobao-mnn/eagle3) using sglang, I have to downgrade my environment to:
sglang[all] == 0.5.3
and make simple modifications to the code in sglang/python/sglang/srt/models/qwen3_vl.py, as described in https://github.com/sgl-project/sglang/pull/8801
On sglang[all] == 0.5.4, Qwen2.5-VL + Eagle3 works correctly, but Qwen3-VL + Eagle3 does not — the evaluation only works if I roll back to 0.5.3.
I’m not sure what the underlying version differences are for VLM + Eagle3 support, but this version dependency makes it very difficult for users to reproduce results unless they already know the exact working combination in advance.
Sglang is designing and implementing spec v2, which will handle this issue.
@C3236455482 for qwen2.5-vl, you can try https://github.com/sgl-project/sglang/pull/13904