[Bug] Eagle fail on Llama3-8b

Open biaochen opened this issue 10 months ago • 0 comments

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[ ] 5. Please use English, otherwise it will be closed.

Describe the bug

Hi team. I'm testing eagle. The base model is Meta-Llama-3-8B-Instruct and eagle draft model is sglang-EAGLE-LLaMA3-Instruct-8B. The issue relates to max_position_embeddings, which is 2048 in eagle draft config. But in my case context len will be larger. I can start sglang server but crash when processing requests.

gpu: A100 80G docker image: lmsysorg/sglang:v0.4.2.post4-cu124-srt

start script: SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server
--host 0.0.0.0
--port 80
--served-model-name llama
--model ./Meta-Llama-3-8B-Instruct/
--trust-remote-code
--dtype float16
--mem-fraction 0.5
--max-running-requests 16
--speculative-algo EAGLE
--speculative-draft ./sglang-EAGLE-LLaMA3-Instruct-8B/
--disable-cuda-graph
--context-length 6000 \

crash log: [2025-02-14 06:41:27 TP0] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1798, in run_scheduler_process scheduler.event_loop_normal() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 479, in event_loop_normal self.process_batch_result(batch, result) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1119, in process_batch_result self.process_batch_result_prefill(batch, result) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1150, in process_batch_result_prefill next_token_ids = next_token_ids.tolist() RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

It's there any workaround? Or have to train an eagle model myself with proper context length? Thanks~

Reproduction

start server: SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server
--host 0.0.0.0
--port 80
--served-model-name llama
--model ./Meta-Llama-3-8B-Instruct/
--trust-remote-code
--dtype float16
--mem-fraction 0.5
--max-running-requests 16
--speculative-algo EAGLE
--speculative-draft ./sglang-EAGLE-LLaMA3-Instruct-8B/
--disable-cuda-graph
--context-length 6000 \

call with prompt longer then 2k

Environment

gpu: A100 80G docker image: lmsysorg/sglang:v0.4.2.post4-cu124-srt

Feb 14 '25 08:02 biaochen