[Bug] Eagle fail on Llama3-8b
Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.
Describe the bug
Hi team. I'm testing eagle. The base model is Meta-Llama-3-8B-Instruct and eagle draft model is sglang-EAGLE-LLaMA3-Instruct-8B. The issue relates to max_position_embeddings, which is 2048 in eagle draft config. But in my case context len will be larger. I can start sglang server but crash when processing requests.
gpu: A100 80G docker image: lmsysorg/sglang:v0.4.2.post4-cu124-srt
start script:
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server
--host 0.0.0.0
--port 80
--served-model-name llama
--model ./Meta-Llama-3-8B-Instruct/
--trust-remote-code
--dtype float16
--mem-fraction 0.5
--max-running-requests 16
--speculative-algo EAGLE
--speculative-draft ./sglang-EAGLE-LLaMA3-Instruct-8B/
--disable-cuda-graph
--context-length 6000 \
crash log:
[2025-02-14 06:41:27 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1798, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 479, in event_loop_normal
self.process_batch_result(batch, result)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1119, in process_batch_result
self.process_batch_result_prefill(batch, result)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1150, in process_batch_result_prefill
next_token_ids = next_token_ids.tolist()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
It's there any workaround? Or have to train an eagle model myself with proper context length? Thanks~
Reproduction
start server:
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server
--host 0.0.0.0
--port 80
--served-model-name llama
--model ./Meta-Llama-3-8B-Instruct/
--trust-remote-code
--dtype float16
--mem-fraction 0.5
--max-running-requests 16
--speculative-algo EAGLE
--speculative-draft ./sglang-EAGLE-LLaMA3-Instruct-8B/
--disable-cuda-graph
--context-length 6000 \
call with prompt longer then 2k
Environment
gpu: A100 80G docker image: lmsysorg/sglang:v0.4.2.post4-cu124-srt