[Bug] Decode OOM with spec
Describe the bug
Bugs can be reproduced with this branch #13740.
python -m unittest test_eagle_infer_b.py
Anyone interested can take a look and get assigned. It is a good issue to deep dive into SGLang's memory managment and advanced speculative decodings.
I have experience with Spec, can I look into it?
@adityakamat24 Sure, please do it
Hey @hnyls2002
Did some digging. I think the OOM is from kv_allocated_len not getting updated in the paged allocation paths.
In eagle_info.py's prepare_for_verify(), the page_size == 1 branch updates kv_allocated_len after allocation. But the paged branch (else case) just calls alloc_paged_token_slots_extend() and never touches it. Same issue in eagle_worker.py _draft_preprocess_decode(). So when release_kv_cache() runs, it uses kv_allocated_len to know what to free. If that's never updated, nothing gets freed. With 400 requests doing multiple decode iterations, memory just leaks until OOM.
what do you think?
@hnyls2002 can I also try this?