SGLang Launch error with Qwen2.5-14B draft model, need help
My target model is Qwen-2.5-14B
I Use default config train a draft model, the train_eagle3_online.py generated a eagle3-config.json, content is:
{ "architectures": [ "LlamaForCausalLMEagle3" ], "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 13824, "max_position_embeddings": 131072, "model_type": "llama", "num_attention_heads": 40, "num_key_value_heads": 8, "num_hidden_layers": 1, "pad_token_id": 0, "rms_norm_eps": 1e-06, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.28.1", "use_cache": true, "vocab_size": 152064, "draft_vocab_size": 32000 }
after traning, i use sglang to load draft model, cmd is:
python -m sglang.launch_server --model-path /home/Models/Qwen2.5-14B-Instruct --host 0.0.0.0 --port 30000 --tp-size 2 --served-model-name qwen2 --context-length 2048 --speculative-algorithm EAGLE3 --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --mem-fraction 0.6 --cuda-graph-max-bs 2 --dtype float16 --speculative-draft-model-path /home/Models/epoch_2
error stack is:
Capturing batches (bs=2 avail_mem=14.40 GB): 0%| | 0/2 [00:00<?, ?it/s][2025-09-13 18:14:09 TP1] Registering 0 cuda graph addresses Capturing batches (bs=2 avail_mem=14.40 GB): 0%| | 0/2 [00:01<?, ?it/s] [2025-09-13 18:14:09 TP0] Registering 0 cuda graph addresses [2025-09-13 18:14:09 TP1] Scheduler hit an exception: Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2587, in run_scheduler_process scheduler = Scheduler( ^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 329, in init self.tp_worker = TpWorkerClass( ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 93, in init self.model_runner = ModelRunner( ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 250, in init self.initialize(min_per_gpu_memory) File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 386, in initialize self.init_device_graphs() File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1761, in init_device_graphs self.graph_runner = graph_runnersself.device ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 389, in init self.capture() File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 497, in capture ) = self.capture_one_batch_size(bs, forward) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 668, in capture_one_batch_size run_once() File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 657, in run_once logits_output_or_pp_proxy_tensors = forward( ^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/qwen2.py", line 489, in forward hidden_states, aux_hidden_states = hidden_states ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: too many values to unpack (expected 2)
I'm new to this, i need some help!
SGLang version is :
import sglang sglang.version '0.5.2'
the draft model's config.json is:
{ "architectures": [ "LlamaForCausalLMEagle3" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 151643, "draft_vocab_size": 32000, "eos_token_id": 151645, "head_dim": 128, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 13824, "max_position_embeddings": 131072, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 40, "num_hidden_layers": 1, "num_key_value_heads": 8, "pad_token_id": 0, "pretraining_tp": 1, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 10000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.55.2", "use_cache": true, "vocab_size": 152064 }
You can post your draft model, then we can reproduce this bug.
I have fix it.
@justadogistaken Thank you for your concern. However, my draft model is quite large — even after compression it’s about 800MB, so uploading and downloading might not be very convenient. Could you please share with me the approach and direction for troubleshooting? I can try to resolve the issue myself first, and if it really doesn’t work, I’ll upload the model file later.
@jiapingW Let me take a look into it.
@jiapingW Let me take a look into it.
You can follow the code here. https://github.com/sgl-project/sglang/pull/10517
@jiapingW It works, thanks a lot!