[Bug]DeepSeek-R1 IndexError: index 2383 is out of bounds for dimension 0 with size 2383
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
2025-02-16 19:29:38 TP4] Scheduler hit an exception: Traceback (most recent call last): File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process scheduler.event_loop_normal() File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 479, in event_loop_normal self.process_batch_result(batch, result) File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1120, in process_batch_result self.process_batch_result_prefill(batch, result) File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1181, in process_batch_result_prefill self.tree_cache.cache_unfinished_req(req) File "/opt/tiger/sglang/python/sglang/srt/mem_cache/chunk_cache.py", line 62, in cache_unfinished_req kv_indices = self.req_to_token_pool.req_to_token[ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ IndexError: index 2383 is out of bounds for dimension 0 with size 2383
[2025-02-16 19:29:38 TP7] Scheduler hit an exception: Traceback (most recent call last): File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process scheduler.event_loop_normal() File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 479, in event_loop_normal self.process_batch_result(batch, result) File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1120, in process_batch_result self.process_batch_result_prefill(batch, result) File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1181, in process_batch_result_prefill self.tree_cache.cache_unfinished_req(req) File "/opt/tiger/sglang/python/sglang/srt/mem_cache/chunk_cache.py", line 62, in cache_unfinished_req kv_indices = self.req_to_token_pool.req_to_token[ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ IndexError: index 2383 is out of bounds for dimension 0 with size 2383
Reproduction
two H20 Nodes
python -m sglang.launch_server --model-path /opt/tiger/DeepSeek-R1 --tp 16 --dist-init-addr ip:3894 --nnodes 2 --node-rank 0 --kv-cache-dtype fp8_e5m2 --trust-remote-code --port $PORT0 --speculative-algo NEXTN --speculative-draft /opt/tiger/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.7
python -m sglang.launch_server --model-path /opt/tiger/DeepSeek-R1 --tp 16 --dist-init-addr ip:3894 --nnodes 2 --node-rank 1 --kv-cache-dtype fp8_e5m2 --trust-remote-code --port $PORT0 --speculative-algo NEXTN --speculative-draft /opt/tiger/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --mem-fraction-static 0.7
Environment
Python: 3.11.2 (main, Jul 23 2024, 17:09:09) [GCC 10.2.1 20210110] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H20 GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.4, V12.4.131 CUDA Driver Version: 535.183.06 PyTorch: 2.5.1+cu124 sglang: 0.4.3 sgl_kernel: 0.0.3.post6 flashinfer: 0.2.1.post1 triton: 3.1.0 transformers: 4.48.3 torchao: 0.7.0 numpy: 1.26.4 aiohttp: 3.11.12 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.22.0 orjson: 3.10.14 packaging: 24.2 psutil: 7.0.0 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.7.2 openai: 1.63.0 tiktoken: 0.9.0 anthropic: 0.42.0 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS PIX NODE NODE NODE SYS SYS SYS SYS 1,4-89 0N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS NODE PIX NODE NODE SYS SYS SYS SYS 1,4-89 0N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS NODE NODE PIX NODE SYS SYS SYS SYS 1,4-89 0N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS NODE NODE NODE PIX SYS SYS SYS SYS 1,4-89 0N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS PIX NODE NODE NODE 91,94-179 1 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS NODE PIX NODE NODE 91,94-179 1 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS NODE NODE PIX NODE 91,94-179 1 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS NODE NODE NODE PIX 91,94-179 1 N/A NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS SYS NIC1 PIX NODE NODE NODE SYS SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS NIC2 NODE PIX NODE NODE SYS SYS SYS SYS SYS NODE X NODE NODE SYS SYS SYS SYS NIC3 NODE NODE PIX NODE SYS SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYS SYS NIC4 NODE NODE NODE PIX SYS SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS NIC5 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS X NODE NODE NODE NIC6 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS SYS NODE X NODE NODE NIC7 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS SYS NODE NODE X NODE NIC8 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS SYS NODE NODE NODE X
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8
Hypervisor vendor: KVM ulimit soft: 1024768
max_running_requests=2383
CC @ispobock
I hit the same bug when enable NEXTN
Please disable radix cache with nextn is on.
Please disable radix cache with nextn is on. When NEXTN is enabled, it seems that the radix cache is disabled by default