nano-vllm [BUG] Crashes when the prompt length exactly equals kvcache_block

[BUG] Crashes when the prompt length exactly equals kvcache_block_size

Open a710128 opened this issue 1 month ago • 0 comments

prompts = [
    "Hello" * 248, 
] * 513

I ran example.py with the above prompts, and it crashed.

[rank0]: torch.AcceleratorError: CUDA error: invalid configuration argument
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

It seems that it is caused by the prefix cache, resulting in a tensor length of 0 for the model input.

Oct 13 '25 07:10 a710128

nano-vllm nano-vllm copied to clipboard

[BUG] Crashes when the prompt length exactly equals kvcache_block_size

nano-vllm
nano-vllm copied to clipboard