SpecForge EAGLE3 on Qwen2.5-VL / Qwen3-VL shows extremely low accept length (accept

Hi, thanks for your great work on SGLang and SpecForge!

I am trying to test https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl on Qwen2.5-VL using the reference configs from: https://github.com/sgl-project/SpecForge/pull/102 , but the speculative decoding performance is far below expectations.

Below is a detailed report of my setup, logs, and results.

1. My SGLang server command

python -m sglang.launch_server \
    --model-path /ch/pretrained_models/Qwen2.5-VL-7B-Instruct \
    --speculative-draft-model-path /ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl \
    --speculative-algorithm EAGLE3 \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 24 \
    --trust-remote-code \
    --chunked-prefill-size -1 \
    --cuda-graph-max-bs 1 \
    --tp 1 \
    --mem-fraction-static 0.7 \
    --host 0.0.0.0 \
    --port 8080

Client benchmark:

python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 50

2. Results (Qwen2.5-VL with EAGLE3)

Average Latency: 92.421 s
Average Output throughput: 41.960 token/s
Average Accept length: 1.037

3. SGLang logs (accept length always ≈ 1)

Below are several captured decode logs:

[2025-11-19 15:25:25] Decode batch, #running-req: 1, #token: 353, token usage: 0.04, accept len: 1.02, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 44.13, #queue-req: 0, 
[2025-11-19 15:25:26] Decode batch, #running-req: 1, #token: 393, token usage: 0.05, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.13, #queue-req: 0, 
[2025-11-19 15:25:27] Decode batch, #running-req: 1, #token: 433, token usage: 0.05, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.12, #queue-req: 0, 
[2025-11-19 15:25:28] Decode batch, #running-req: 1, #token: 474, token usage: 0.06, accept len: 1.02, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 44.20, #queue-req: 0, 
[2025-11-19 15:25:29] Decode batch, #running-req: 1, #token: 514, token usage: 0.06, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.01, #queue-req: 0, 
[2025-11-19 15:25:30] Decode batch, #running-req: 1, #token: 557, token usage: 0.06, accept len: 1.07, accept rate: 0.21, cuda graph: True, gen throughput (token/s): 46.21, #queue-req: 0,

This suggests that the draft model’s predictions are almost always rejected.

4. Similar behavior on Qwen3-VL

The result is essentially the same: accept_len ≈ 1.

5. However: Llama-3.1-8B + EAGLE3 works correctly

Using the same speculative settings:

speculative-num-steps=4
speculative-eagle-topk=6
speculative-num-draft-tokens=24

with https://huggingface.co/lmsys/sglang-EAGLE-LLaMA3-Instruct-8B on gsm8k I get expected results:

Average Latency: 52.161 s
Average Output throughput: 86.099 token/s
Average Accept length: 2.313

So the EAGLE3 pipeline works normally on Llama models.

6. VLLM results: Qwen2.5-VL EAGLE3 behaves correctly

I also tested Qwen2.5-VL EAGLE3 in vLLM, using configs from https://github.com/vllm-project/vllm/pull/22872

Example command:

vllm serve \
    /ch/pretrained_models/Qwen2.5-VL-7B-Instruct \
    --port 5580 --host 0.0.0.0 \
    --max-num-seqs 128 --dtype bfloat16 --max-model-len=8192  \
    --no-enable-prefix-caching --trust-remote-code -tp 1\
    --speculative-config '{"method": "eagle3", "model": "/ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "max_model_len": 8192}' \
    --num-lookahead-slots=3 \
    --gpu-memory-utilization=0.93

Results:

with EAGLE3: Output token throughput (tok/s) = 135.67
without EAGLE3: Output token throughput (tok/s) = 97.92
end-to-end speedup ≈ 1.385× → ✔ expected behavior

Meaning: The Qwen2.5-VL EAGLE3 draft model itself is fine but SGLang’s integration leads to extremely low accept_len.

7. My question

Is my configuration missing anything specific for multimodal models?
Are additional modifications needed beyond PR #8801 to fully support Qwen VL EAGLE3?

Any guidance or hints would be greatly appreciated. Thank you very much for your help!

Nov 19 '25 07:11 C3236455482

Thank you for your feedback. I'll test it today.

Nov 20 '25 00:11 jiapingW

https://github.com/sgl-project/SpecForge/pull/279 Hi, it seems this issue is related to the SGLang-side integration. Could you help test this PR? You can evaluate the accept length of Qwen VL 2.5 without relying on SGLang. Currently, tree decoding is not available, and the batch size should be set to 1, with top-k = 1, i.e., only a single decoding path.

Nov 20 '25 03:11 Lihui-Gu

Hi, thanks for your great work on SGLang and SpecForge!

I am trying to test https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl on Qwen2.5-VL using the reference configs from: #102 , but the speculative decoding performance is far below expectations.

Below is a detailed report of my setup, logs, and results.

1. My SGLang server command

python -m sglang.launch_server
--model-path /ch/pretrained_models/Qwen2.5-VL-7B-Instruct
--speculative-draft-model-path /ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl
--speculative-algorithm EAGLE3
--speculative-num-steps 4
--speculative-eagle-topk 1
--speculative-num-draft-tokens 24
--trust-remote-code
--chunked-prefill-size -1
--cuda-graph-max-bs 1
--tp 1
--mem-fraction-static 0.7
--host 0.0.0.0
--port 8080 Client benchmark:

python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 50

2. Results (Qwen2.5-VL with EAGLE3)
Average Latency: 92.421 s
Average Output throughput: 41.960 token/s
Average Accept length: 1.037
3. SGLang logs (accept length always ≈ 1)

Below are several captured decode logs:
[2025-11-19 15:25:25] Decode batch, #running-req: 1, #token: 353, token usage: 0.04, accept len: 1.02, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 44.13, #queue-req: 0, 
[2025-11-19 15:25:26] Decode batch, #running-req: 1, #token: 393, token usage: 0.05, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.13, #queue-req: 0, 
[2025-11-19 15:25:27] Decode batch, #running-req: 1, #token: 433, token usage: 0.05, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.12, #queue-req: 0, 
[2025-11-19 15:25:28] Decode batch, #running-req: 1, #token: 474, token usage: 0.06, accept len: 1.02, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 44.20, #queue-req: 0, 
[2025-11-19 15:25:29] Decode batch, #running-req: 1, #token: 514, token usage: 0.06, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.01, #queue-req: 0, 
[2025-11-19 15:25:30] Decode batch, #running-req: 1, #token: 557, token usage: 0.06, accept len: 1.07, accept rate: 0.21, cuda graph: True, gen throughput (token/s): 46.21, #queue-req: 0, 
This suggests that the draft model’s predictions are almost always rejected.

4. Similar behavior on Qwen3-VL

The result is essentially the same: accept_len ≈ 1.

5. However: Llama-3.1-8B + EAGLE3 works correctly

Using the same speculative settings:

speculative-num-steps=4

speculative-eagle-topk=6

speculative-num-draft-tokens=24

with https://huggingface.co/lmsys/sglang-EAGLE-LLaMA3-Instruct-8B on gsm8k I get expected results:
Average Latency: 52.161 s
Average Output throughput: 86.099 token/s
Average Accept length: 2.313
So the EAGLE3 pipeline works normally on Llama models.

6. VLLM results: Qwen2.5-VL EAGLE3 behaves correctly

I also tested Qwen2.5-VL EAGLE3 in vLLM, using configs from vllm-project/vllm#22872

Example command:

vllm serve
/ch/pretrained_models/Qwen2.5-VL-7B-Instruct
--port 5580 --host 0.0.0.0
--max-num-seqs 128 --dtype bfloat16 --max-model-len=8192
--no-enable-prefix-caching --trust-remote-code -tp 1
--speculative-config '{"method": "eagle3", "model": "/ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "max_model_len": 8192}'
--num-lookahead-slots=3
--gpu-memory-utilization=0.93 Results:

with EAGLE3: Output token throughput (tok/s) = 135.67

without EAGLE3: Output token throughput (tok/s) = 97.92

end-to-end speedup ≈ 1.385× → ✔ expected behavior

Meaning: The Qwen2.5-VL EAGLE3 draft model itself is fine but SGLang’s integration leads to extremely low accept_len.

7. My question

Is my configuration missing anything specific for multimodal models?

Are additional modifications needed beyond PR #8801 to fully support Qwen VL EAGLE3?

Any guidance or hints would be greatly appreciated. Thank you very much for your help!

I have reply your result. I think it's a bug of latest sglang Qwen2.5-VL eagle3 impl. I will try to fix it.

Nov 20 '25 06:11 jiapingW

I test use sglang==0.5.4. The result is below which is OK.

Created temporary image directory: .cache/mmstar_specforge
Loaded 100 questions.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:44<00:00,  2.24it/s]
Warning: 15 predictions could not be extracted.

==================================================
MMSTAR Evaluation Results
==================================================
Number of questions: 100
Average Accuracy: 0.3400 (34.00%)
Average Latency: 44.846 s
Average Output throughput: 147.304 token/s
Average Accept length: 2.195
==================================================

Deleted temporary directory: .cache/mmstar_specforge

But if I use the sglang==0.5.5.post3. The average accept length is about 1.0 which is not correct.

Nov 20 '25 09:11 jiapingW

I test use sglang==0.5.4. The result is below which is OK.

Created temporary image directory: .cache/mmstar_specforge Loaded 100 questions. 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:44<00:00, 2.24it/s] Warning: 15 predictions could not be extracted.

================================================== MMSTAR Evaluation Results

Number of questions: 100 Average Accuracy: 0.3400 (34.00%) Average Latency: 44.846 s Average Output throughput: 147.304 token/s Average Accept length: 2.195

Deleted temporary directory: .cache/mmstar_specforge But if I use the sglang==0.5.5.post3. The average accept length is about 1.0 which is not correct.

Thank you so much for your reply!

You are right, everything worked fine after I installed "sglang[all]==0.5.4", but later versions seem to have some bugs with the eagle3 method for VLM.

Nov 20 '25 10:11 C3236455482

I test use sglang==0.5.4. The result is below which is OK.

Created temporary image directory: .cache/mmstar_specforge Loaded 100 questions. 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:44<00:00, 2.24it/s] Warning: 15 predictions could not be extracted.

================================================== MMSTAR Evaluation Results

Number of questions: 100 Average Accuracy: 0.3400 (34.00%) Average Latency: 44.846 s Average Output throughput: 147.304 token/s Average Accept length: 2.195

Deleted temporary directory: .cache/mmstar_specforge But if I use the sglang==0.5.5.post3. The average accept length is about 1.0 which is not correct.

When I test Qwen3-VL with Eagle3 (https://huggingface.co/collections/taobao-mnn/eagle3) using sglang, I have to downgrade my environment to:

sglang[all] == 0.5.3

and make simple modifications to the code in sglang/python/sglang/srt/models/qwen3_vl.py, as described in https://github.com/sgl-project/sglang/pull/8801

On sglang[all] == 0.5.4, Qwen2.5-VL + Eagle3 works correctly, but Qwen3-VL + Eagle3 does not — the evaluation only works if I roll back to 0.5.3.

I’m not sure what the underlying version differences are for VLM + Eagle3 support, but this version dependency makes it very difficult for users to reproduce results unless they already know the exact working combination in advance.

Nov 20 '25 10:11 C3236455482

Sglang is designing and implementing spec v2, which will handle this issue.

Nov 20 '25 12:11 jiapingW

@C3236455482 for qwen2.5-vl, you can try https://github.com/sgl-project/sglang/pull/13904

Nov 25 '25 08:11 Lzhang-hub

EAGLE3 on Qwen2.5-VL / Qwen3-VL shows extremely low accept length (accept_len ≈ 1)

1. My SGLang server command

2. Results (Qwen2.5-VL with EAGLE3)

3. SGLang logs (accept length always ≈ 1)

4. Similar behavior on Qwen3-VL

5. However: Llama-3.1-8B + EAGLE3 works correctly

6. VLLM results: Qwen2.5-VL EAGLE3 behaves correctly

7. My question

1. My SGLang server command

2. Results (Qwen2.5-VL with EAGLE3)

3. SGLang logs (accept length always ≈ 1)

4. Similar behavior on Qwen3-VL

5. However: Llama-3.1-8B + EAGLE3 works correctly

6. VLLM results: Qwen2.5-VL EAGLE3 behaves correctly

7. My question

================================================== MMSTAR Evaluation Results

Number of questions: 100 Average Accuracy: 0.3400 (34.00%) Average Latency: 44.846 s Average Output throughput: 147.304 token/s Average Accept length: 2.195

================================================== MMSTAR Evaluation Results

Number of questions: 100 Average Accuracy: 0.3400 (34.00%) Average Latency: 44.846 s Average Output throughput: 147.304 token/s Average Accept length: 2.195