nano-vllm icon indicating copy to clipboard operation
nano-vllm copied to clipboard

fix: update can_append and may_append logic for block allocation

Open skyloevil opened this issue 5 months ago • 5 comments

Fix: Correct off-by-one error in KV-Cache block allocation

This pull request addresses a critical off-by-one error in the BlockManager's logic for allocating new KV-Cache blocks during the decoding phase. The fix ensures that block allocation is timely and atomic, preventing potential errors and ensuring the stability of the generation process.

The Problem

The previous implementation had a logical flaw in the can_append and may_append methods, which manifested in two ways:

  1. Delayed Allocation Check: The condition len(seq) % self.block_size == 1 in can_append was incorrect. It checked for the need to allocate a new block after the first token of that new block had already been generated. This is too late and represents a classic off-by-one error in state management. The check should occur precisely when the last block has just been filled.

  2. Fragmented Logic: The hashing of a completed block and the allocation of a new block were handled at different times. The hash was computed when len(seq) % self.block_size == 0, but the new block was only allocated when len(seq) % self.block_size == 1. These two actions should be part of a single, atomic operation that occurs when a block is finalized.

This flawed logic could lead to incorrect state or errors when a sequence's length crossed a block_size boundary during decoding.

The Solution

This PR corrects the logic to be proactive and atomic:

  1. Corrected Condition: The condition in can_append and may_append is now len(seq) > 0 and len(seq) % self.block_size == 0. This accurately identifies the exact moment a block has been completely filled.

  2. Atomic Operation: The logic within may_append has been consolidated. Now, when the condition is met, the following actions are performed in a single step:

    • The just-filled block has its hash computed and stored for prefix caching.
    • A new, empty block is immediately allocated from the free list.
    • This new block is appended to the sequence's block_table.

This ensures that a sequence always has a ready and available block before the next token's KV-Cache needs to be written.

Verification Process

The correctness of this change was verified through the following steps:

  1. Logical Analysis: A thorough review of the BlockManager's state machine was conducted, focusing on the lifecycle of a Sequence as it grows during decoding. The analysis confirmed that the original logic failed to handle the boundary condition of a block becoming full.

  2. Critical Scenario Simulation: The primary test case considered was a sequence growing to a length that is an exact multiple of block_size (e.g., len(seq) == 256).

    • Before Fix: In this scenario, the original code would have proceeded to the next generation step without a new block allocated, leading to a state where there is no valid memory location to write the new token's KV-Cache.
    • After Fix: The corrected logic correctly identifies this boundary condition. It finalizes the full block by computing its hash and immediately allocates a new block. This ensures the decoding process can continue seamlessly and correctly.

This change is a crucial bug fix that enhances the robustness and correctness of the KV-Cache management system. I kindly request a review and approval for this pull request.

skyloevil avatar Jul 06 '25 14:07 skyloevil