nano-vllm icon indicating copy to clipboard operation
nano-vllm copied to clipboard

Prevented fully cached sequences from being processed in prefill phase

Open codingliuyg opened this issue 1 month ago • 3 comments

Thank you for this amazing open-source project!

Problem

Fixes the crash reported in #114 where the system crashes when prompt length exactly equals kvcache_block_size (256 tokens). Additionally, optimizes scheduler performance by preventing fully cached sequences from entering prefill computation.

Root Cause Analysis

  1. Crash Issue: When seq.num_cached_tokens == len(seq), all tokens are cached, resulting in tokens_to_compute = 0
  2. Empty Tensor Problem: Zero-length tensors are created and passed to CUDA kernels, causing "invalid configuration argument" error
  3. Resource Waste: Fully cached sequences were still being processed in prefill phase, wasting computational resources

Solution

1. Scheduler Optimization (scheduler.py)

Before: All sequences were added to scheduled_seqs regardless of cache status

num_seqs += 1
scheduled_seqs.append(seq)
num_batched_tokens += len(seq) - seq.num_cached_tokens

After: Only sequences with tokens_to_compute > 0 are scheduled for computation

tokens_to_compute = len(seq) - seq.num_cached_tokens
if tokens_to_compute > 0:
    num_seqs += 1
    scheduled_seqs.append(seq)
    num_batched_tokens += tokens_to_compute

Benefits:

  • ✅ Prevents empty tensor creation
  • ✅ Eliminates unnecessary CUDA kernel launches
  • ✅ Improves batch efficiency in high cache-hit scenarios

2. Block Manager (block_manager.py)

Before: Assertion failure when last_block.hash == -1 in certain edge cases

assert last_block.hash == -1
# ... hash computation code

After: Conditional check to handle edge cases gracefully

if last_block.hash == -1:
    # ... hash computation code

Purpose: This change ensures compatibility with fully cached sequences where the previous hash value is already correct and doesn't need recalculation.

Closes #114

I look forward to your review and would appreciate any feedback.

codingliuyg avatar Oct 16 '25 09:10 codingliuyg

I think even if the entire seq is cached, it should still be scheduled, and the input at this time only needs to be the last token.

GentleCold avatar Oct 28 '25 13:10 GentleCold

I think even if the entire seq is cached, it should still be scheduled, and the input at this time only needs to be the last token.

@GentleCold You're absolutely right, and this doesn't conflict with my code. You're referring to the decode phase, and that's exactly what my implementation does: fully cached sequences are put directly into the running queue to be processed in the decode phase.

The key point is that fully cached sequences skip the prefill computation(where tokens_to_compute == 0), but they still get scheduled for decode operations.

codingliuyg avatar Nov 03 '25 03:11 codingliuyg

I found, thank you for your explanation :)

GentleCold avatar Nov 04 '25 12:11 GentleCold