nano-vllm Prevented fully cached sequences from being processed in prefill phase

Thank you for this amazing open-source project!

Problem

Fixes the crash reported in #114 where the system crashes when prompt length exactly equals kvcache_block_size (256 tokens). Additionally, optimizes scheduler performance by preventing fully cached sequences from entering prefill computation.

Root Cause Analysis

Crash Issue: When seq.num_cached_tokens == len(seq), all tokens are cached, resulting in tokens_to_compute = 0
Empty Tensor Problem: Zero-length tensors are created and passed to CUDA kernels, causing "invalid configuration argument" error
Resource Waste: Fully cached sequences were still being processed in prefill phase, wasting computational resources

Solution

1. Scheduler Optimization (`scheduler.py`)

Before: All sequences were added to scheduled_seqs regardless of cache status

num_seqs += 1
scheduled_seqs.append(seq)
num_batched_tokens += len(seq) - seq.num_cached_tokens

After: Only sequences with tokens_to_compute > 0 are scheduled for computation

tokens_to_compute = len(seq) - seq.num_cached_tokens
if tokens_to_compute > 0:
    num_seqs += 1
    scheduled_seqs.append(seq)
    num_batched_tokens += tokens_to_compute

Benefits:

✅ Prevents empty tensor creation
✅ Eliminates unnecessary CUDA kernel launches
✅ Improves batch efficiency in high cache-hit scenarios

2. Block Manager (`block_manager.py`)

Before: Assertion failure when last_block.hash == -1 in certain edge cases

assert last_block.hash == -1
# ... hash computation code

After: Conditional check to handle edge cases gracefully

if last_block.hash == -1:
    # ... hash computation code

Purpose: This change ensures compatibility with fully cached sequences where the previous hash value is already correct and doesn't need recalculation.

Closes #114

I look forward to your review and would appreciate any feedback.

Oct 16 '25 09:10 codingliuyg

I think even if the entire seq is cached, it should still be scheduled, and the input at this time only needs to be the last token.

Oct 28 '25 13:10 GentleCold

I think even if the entire seq is cached, it should still be scheduled, and the input at this time only needs to be the last token.

@GentleCold You're absolutely right, and this doesn't conflict with my code. You're referring to the decode phase, and that's exactly what my implementation does: fully cached sequences are put directly into the running queue to be processed in the decode phase.

The key point is that fully cached sequences skip the prefill computation(where tokens_to_compute == 0), but they still get scheduled for decode operations.

Nov 03 '25 03:11 codingliuyg

I found, thank you for your explanation :)

Nov 04 '25 12:11 GentleCold

nano-vllm nano-vllm copied to clipboard

Prevented fully cached sequences from being processed in prefill phase

Problem

Root Cause Analysis

Solution

1. Scheduler Optimization (scheduler.py)

2. Block Manager (block_manager.py)

nano-vllm
nano-vllm copied to clipboard

1. Scheduler Optimization (`scheduler.py`)

2. Block Manager (`block_manager.py`)