nano-vllm
nano-vllm copied to clipboard
Prevented fully cached sequences from being processed in prefill phase
Thank you for this amazing open-source project!
Problem
Fixes the crash reported in #114 where the system crashes when prompt length exactly equals kvcache_block_size (256 tokens). Additionally, optimizes scheduler performance by preventing fully cached sequences from entering prefill computation.
Root Cause Analysis
- Crash Issue: When
seq.num_cached_tokens == len(seq), all tokens are cached, resulting intokens_to_compute = 0 - Empty Tensor Problem: Zero-length tensors are created and passed to CUDA kernels, causing "invalid configuration argument" error
- Resource Waste: Fully cached sequences were still being processed in prefill phase, wasting computational resources
Solution
1. Scheduler Optimization (scheduler.py)
Before: All sequences were added to scheduled_seqs regardless of cache status
num_seqs += 1
scheduled_seqs.append(seq)
num_batched_tokens += len(seq) - seq.num_cached_tokens
After: Only sequences with tokens_to_compute > 0 are scheduled for computation
tokens_to_compute = len(seq) - seq.num_cached_tokens
if tokens_to_compute > 0:
num_seqs += 1
scheduled_seqs.append(seq)
num_batched_tokens += tokens_to_compute
Benefits:
- ✅ Prevents empty tensor creation
- ✅ Eliminates unnecessary CUDA kernel launches
- ✅ Improves batch efficiency in high cache-hit scenarios
2. Block Manager (block_manager.py)
Before: Assertion failure when last_block.hash == -1 in certain edge cases
assert last_block.hash == -1
# ... hash computation code
After: Conditional check to handle edge cases gracefully
if last_block.hash == -1:
# ... hash computation code
Purpose: This change ensures compatibility with fully cached sequences where the previous hash value is already correct and doesn't need recalculation.
Closes #114
I look forward to your review and would appreciate any feedback.
I think even if the entire seq is cached, it should still be scheduled, and the input at this time only needs to be the last token.
I think even if the entire seq is cached, it should still be scheduled, and the input at this time only needs to be the last token.
@GentleCold You're absolutely right, and this doesn't conflict with my code. You're referring to the decode phase, and that's exactly what my implementation does: fully cached sequences are put directly into the running queue to be processed in the decode phase.
The key point is that fully cached sequences skip the prefill computation(where tokens_to_compute == 0), but they still get scheduled for decode operations.
I found, thank you for your explanation :)