SpecForge icon indicating copy to clipboard operation
SpecForge copied to clipboard

EAGLE3 on Qwen2.5-VL / Qwen3-VL shows extremely low accept length (accept_len ≈ 1)

Open C3236455482 opened this issue 1 month ago • 8 comments

Hi, thanks for your great work on SGLang and SpecForge!

I am trying to test https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl on Qwen2.5-VL using the reference configs from: https://github.com/sgl-project/SpecForge/pull/102 , but the speculative decoding performance is far below expectations.

Below is a detailed report of my setup, logs, and results.

1. My SGLang server command

python -m sglang.launch_server \
    --model-path /ch/pretrained_models/Qwen2.5-VL-7B-Instruct \
    --speculative-draft-model-path /ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl \
    --speculative-algorithm EAGLE3 \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 24 \
    --trust-remote-code \
    --chunked-prefill-size -1 \
    --cuda-graph-max-bs 1 \
    --tp 1 \
    --mem-fraction-static 0.7 \
    --host 0.0.0.0 \
    --port 8080

Client benchmark:

python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 50

2. Results (Qwen2.5-VL with EAGLE3)

Average Latency: 92.421 s
Average Output throughput: 41.960 token/s
Average Accept length: 1.037

3. SGLang logs (accept length always ≈ 1)

Below are several captured decode logs:

[2025-11-19 15:25:25] Decode batch, #running-req: 1, #token: 353, token usage: 0.04, accept len: 1.02, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 44.13, #queue-req: 0, 
[2025-11-19 15:25:26] Decode batch, #running-req: 1, #token: 393, token usage: 0.05, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.13, #queue-req: 0, 
[2025-11-19 15:25:27] Decode batch, #running-req: 1, #token: 433, token usage: 0.05, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.12, #queue-req: 0, 
[2025-11-19 15:25:28] Decode batch, #running-req: 1, #token: 474, token usage: 0.06, accept len: 1.02, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 44.20, #queue-req: 0, 
[2025-11-19 15:25:29] Decode batch, #running-req: 1, #token: 514, token usage: 0.06, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.01, #queue-req: 0, 
[2025-11-19 15:25:30] Decode batch, #running-req: 1, #token: 557, token usage: 0.06, accept len: 1.07, accept rate: 0.21, cuda graph: True, gen throughput (token/s): 46.21, #queue-req: 0, 

This suggests that the draft model’s predictions are almost always rejected.

4. Similar behavior on Qwen3-VL

The result is essentially the same: accept_len ≈ 1.

5. However: Llama-3.1-8B + EAGLE3 works correctly

Using the same speculative settings:

  • speculative-num-steps=4
  • speculative-eagle-topk=6
  • speculative-num-draft-tokens=24

with https://huggingface.co/lmsys/sglang-EAGLE-LLaMA3-Instruct-8B on gsm8k I get expected results:

Average Latency: 52.161 s
Average Output throughput: 86.099 token/s
Average Accept length: 2.313

So the EAGLE3 pipeline works normally on Llama models.

6. VLLM results: Qwen2.5-VL EAGLE3 behaves correctly

I also tested Qwen2.5-VL EAGLE3 in vLLM, using configs from https://github.com/vllm-project/vllm/pull/22872

Example command:

vllm serve \
    /ch/pretrained_models/Qwen2.5-VL-7B-Instruct \
    --port 5580 --host 0.0.0.0 \
    --max-num-seqs 128 --dtype bfloat16 --max-model-len=8192  \
    --no-enable-prefix-caching --trust-remote-code -tp 1\
    --speculative-config '{"method": "eagle3", "model": "/ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "max_model_len": 8192}' \
    --num-lookahead-slots=3 \
    --gpu-memory-utilization=0.93

Results:

  • with EAGLE3: Output token throughput (tok/s) = 135.67
  • without EAGLE3: Output token throughput (tok/s) = 97.92
  • end-to-end speedup ≈ 1.385× → ✔ expected behavior

Meaning: The Qwen2.5-VL EAGLE3 draft model itself is fine but SGLang’s integration leads to extremely low accept_len.

7. My question

  • Is my configuration missing anything specific for multimodal models?
  • Are additional modifications needed beyond PR #8801 to fully support Qwen VL EAGLE3?

Any guidance or hints would be greatly appreciated. Thank you very much for your help!

C3236455482 avatar Nov 19 '25 07:11 C3236455482

Thank you for your feedback. I'll test it today.

jiapingW avatar Nov 20 '25 00:11 jiapingW

https://github.com/sgl-project/SpecForge/pull/279 Hi, it seems this issue is related to the SGLang-side integration. Could you help test this PR? You can evaluate the accept length of Qwen VL 2.5 without relying on SGLang. Currently, tree decoding is not available, and the batch size should be set to 1, with top-k = 1, i.e., only a single decoding path.

Lihui-Gu avatar Nov 20 '25 03:11 Lihui-Gu

Hi, thanks for your great work on SGLang and SpecForge!

I am trying to test https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl on Qwen2.5-VL using the reference configs from: #102 , but the speculative decoding performance is far below expectations.

Below is a detailed report of my setup, logs, and results.

1. My SGLang server command

python -m sglang.launch_server
--model-path /ch/pretrained_models/Qwen2.5-VL-7B-Instruct
--speculative-draft-model-path /ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl
--speculative-algorithm EAGLE3
--speculative-num-steps 4
--speculative-eagle-topk 1
--speculative-num-draft-tokens 24
--trust-remote-code
--chunked-prefill-size -1
--cuda-graph-max-bs 1
--tp 1
--mem-fraction-static 0.7
--host 0.0.0.0
--port 8080 Client benchmark:

python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 50

2. Results (Qwen2.5-VL with EAGLE3)

Average Latency: 92.421 s
Average Output throughput: 41.960 token/s
Average Accept length: 1.037

3. SGLang logs (accept length always ≈ 1)

Below are several captured decode logs:

[2025-11-19 15:25:25] Decode batch, #running-req: 1, #token: 353, token usage: 0.04, accept len: 1.02, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 44.13, #queue-req: 0, 
[2025-11-19 15:25:26] Decode batch, #running-req: 1, #token: 393, token usage: 0.05, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.13, #queue-req: 0, 
[2025-11-19 15:25:27] Decode batch, #running-req: 1, #token: 433, token usage: 0.05, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.12, #queue-req: 0, 
[2025-11-19 15:25:28] Decode batch, #running-req: 1, #token: 474, token usage: 0.06, accept len: 1.02, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 44.20, #queue-req: 0, 
[2025-11-19 15:25:29] Decode batch, #running-req: 1, #token: 514, token usage: 0.06, accept len: 1.00, accept rate: 0.20, cuda graph: True, gen throughput (token/s): 43.01, #queue-req: 0, 
[2025-11-19 15:25:30] Decode batch, #running-req: 1, #token: 557, token usage: 0.06, accept len: 1.07, accept rate: 0.21, cuda graph: True, gen throughput (token/s): 46.21, #queue-req: 0, 

This suggests that the draft model’s predictions are almost always rejected.

4. Similar behavior on Qwen3-VL

The result is essentially the same: accept_len ≈ 1.

5. However: Llama-3.1-8B + EAGLE3 works correctly

Using the same speculative settings:

  • speculative-num-steps=4
  • speculative-eagle-topk=6
  • speculative-num-draft-tokens=24

with https://huggingface.co/lmsys/sglang-EAGLE-LLaMA3-Instruct-8B on gsm8k I get expected results:

Average Latency: 52.161 s
Average Output throughput: 86.099 token/s
Average Accept length: 2.313

So the EAGLE3 pipeline works normally on Llama models.

6. VLLM results: Qwen2.5-VL EAGLE3 behaves correctly

I also tested Qwen2.5-VL EAGLE3 in vLLM, using configs from vllm-project/vllm#22872

Example command:

vllm serve
/ch/pretrained_models/Qwen2.5-VL-7B-Instruct
--port 5580 --host 0.0.0.0
--max-num-seqs 128 --dtype bfloat16 --max-model-len=8192
--no-enable-prefix-caching --trust-remote-code -tp 1
--speculative-config '{"method": "eagle3", "model": "/ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "max_model_len": 8192}'
--num-lookahead-slots=3
--gpu-memory-utilization=0.93 Results:

  • with EAGLE3: Output token throughput (tok/s) = 135.67
  • without EAGLE3: Output token throughput (tok/s) = 97.92
  • end-to-end speedup ≈ 1.385× → ✔ expected behavior

Meaning: The Qwen2.5-VL EAGLE3 draft model itself is fine but SGLang’s integration leads to extremely low accept_len.

7. My question

  • Is my configuration missing anything specific for multimodal models?
  • Are additional modifications needed beyond PR #8801 to fully support Qwen VL EAGLE3?

Any guidance or hints would be greatly appreciated. Thank you very much for your help!

I have reply your result. I think it's a bug of latest sglang Qwen2.5-VL eagle3 impl. I will try to fix it.

jiapingW avatar Nov 20 '25 06:11 jiapingW

I test use sglang==0.5.4. The result is below which is OK.

Created temporary image directory: .cache/mmstar_specforge
Loaded 100 questions.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:44<00:00,  2.24it/s]
Warning: 15 predictions could not be extracted.

==================================================
MMSTAR Evaluation Results
==================================================
Number of questions: 100
Average Accuracy: 0.3400 (34.00%)
Average Latency: 44.846 s
Average Output throughput: 147.304 token/s
Average Accept length: 2.195
==================================================

Deleted temporary directory: .cache/mmstar_specforge

But if I use the sglang==0.5.5.post3. The average accept length is about 1.0 which is not correct.

jiapingW avatar Nov 20 '25 09:11 jiapingW

I test use sglang==0.5.4. The result is below which is OK.

Created temporary image directory: .cache/mmstar_specforge Loaded 100 questions. 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:44<00:00, 2.24it/s] Warning: 15 predictions could not be extracted.

================================================== MMSTAR Evaluation Results

Number of questions: 100 Average Accuracy: 0.3400 (34.00%) Average Latency: 44.846 s Average Output throughput: 147.304 token/s Average Accept length: 2.195

Deleted temporary directory: .cache/mmstar_specforge But if I use the sglang==0.5.5.post3. The average accept length is about 1.0 which is not correct.

Thank you so much for your reply!

You are right, everything worked fine after I installed "sglang[all]==0.5.4", but later versions seem to have some bugs with the eagle3 method for VLM.

C3236455482 avatar Nov 20 '25 10:11 C3236455482

I test use sglang==0.5.4. The result is below which is OK.

Created temporary image directory: .cache/mmstar_specforge Loaded 100 questions. 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:44<00:00, 2.24it/s] Warning: 15 predictions could not be extracted.

================================================== MMSTAR Evaluation Results

Number of questions: 100 Average Accuracy: 0.3400 (34.00%) Average Latency: 44.846 s Average Output throughput: 147.304 token/s Average Accept length: 2.195

Deleted temporary directory: .cache/mmstar_specforge But if I use the sglang==0.5.5.post3. The average accept length is about 1.0 which is not correct.

When I test Qwen3-VL with Eagle3 (https://huggingface.co/collections/taobao-mnn/eagle3) using sglang, I have to downgrade my environment to:

sglang[all] == 0.5.3

and make simple modifications to the code in sglang/python/sglang/srt/models/qwen3_vl.py, as described in https://github.com/sgl-project/sglang/pull/8801

On sglang[all] == 0.5.4, Qwen2.5-VL + Eagle3 works correctly, but Qwen3-VL + Eagle3 does not — the evaluation only works if I roll back to 0.5.3.

I’m not sure what the underlying version differences are for VLM + Eagle3 support, but this version dependency makes it very difficult for users to reproduce results unless they already know the exact working combination in advance.

C3236455482 avatar Nov 20 '25 10:11 C3236455482

Sglang is designing and implementing spec v2, which will handle this issue.

jiapingW avatar Nov 20 '25 12:11 jiapingW

@C3236455482 for qwen2.5-vl, you can try https://github.com/sgl-project/sglang/pull/13904

Lzhang-hub avatar Nov 25 '25 08:11 Lzhang-hub