[Bug] Training Eagle3 for gpt-oss-120b fails with `AttributeError: 'NoneType' object has no attribute 'evictable_size'`

Open gopalsarda opened this issue 2 months ago • 1 comments

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
[ ] 5. Please use English, otherwise it will be closed.

Describe the bug

Hi team, I am trying to train Eagle3 for gpt-oss-120b by following the example at run_gpt_oss_120b_eagle3_sgl_online.sh.

I am using docker.io/lmsysorg/sglang:dev as the base image, and run pip install -e . under the SpecForge git directory for installation.

Currently it is failing with the below error. Can someone please help understand what might be happening here? Thanks!

[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/git/SpecForge/scripts/train_eagle3_sgl_online.py", line 775, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/git/SpecForge/scripts/train_eagle3_sgl_online.py", line 771, in main
[rank0]:     trainer.train()
[rank0]:   File "/mnt/git/SpecForge/scripts/train_eagle3_sgl_online.py", line 699, in train
[rank0]:     data_for_draft = self.target_model.forward(
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/git/SpecForge/specforge/modeling/target/sgl_model_wrapper.py", line 253, in forward
[rank0]:     hidden_states_list, aux_hidden_states_list = self.extend(reqs)
[rank0]:                                                  ^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/git/SpecForge/specforge/modeling/target/sgl_model_wrapper.py", line 200, in extend
[rank0]:     return _extend(
[rank0]:            ^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/git/SpecForge/specforge/modeling/target/sgl_model_wrapper.py", line 81, in _extend
[rank0]:     batch.prepare_for_extend()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/schedule_batch.py", line 1266, in prepare_for_extend
[rank0]:     out_cache_loc = self.alloc_token_slots(extend_num_tokens)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/schedule_batch.py", line 988, in alloc_token_slots
[rank0]:     f"{self._available_and_evictable_str()}"
[rank0]:        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/schedule_batch.py", line 1843, in _available_and_evictable_str
[rank0]:     evictable_size = self.tree_cache.evictable_size()
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'NoneType' object has no attribute 'evictable_size'

Reproduction

TARGET_MODEL_PATH=/mnt/models/gpt-oss-120b
EXP_PATH=/mnt/git/SpecForge/exp/2025-10-24
NUM_GPUS=8
MAX_LENGTH=8192
CHAT_TEMPLATE=gpt-oss-naive

torchrun \
    --standalone \
    --nproc_per_node $NUM_GPUS \
    scripts/train_eagle3_sgl_online.py \
    --target-model-path $TARGET_MODEL_PATH \
    --model-path $TARGET_MODEL_PATH \
    --draft-model-config ./configs/gpt-oss-120B-eagle3.json \
    --train-data-path $EXP_PATH/dataset/all_train.jsonl \
    --tp-size $NUM_GPUS \
    --output-dir $EXP_PATH/outputs \
    --num-epochs 2 \
    --batch-size 1 \
    --learning-rate 7e-5 \
    --draft-attention-backend sdpa \
    --draft-global-batch-size 32 \
    --max-length $MAX_LENGTH \
    --chat-template $CHAT_TEMPLATE \
    --cache-dir $EXP_PATH/cache/ \
    --mem-frac=0.4 \
    --total-steps=800000 \
    --warmup-ratio=0.015 \
    --dist-timeout=10 \
    --save-interval 40000 \
    --resume

Environment

I am using docker.io/lmsysorg/sglang:dev as the base image, and run pip install -e . under the SpecForge git directory for installation.

Oct 26 '25 00:10 gopalsarda

@zyksir Sorry for tagging directly. Just wondering if you noticed anything like this during the development of the feature in https://github.com/sgl-project/SpecForge/pull/239

Oct 28 '25 16:10 gopalsarda