nano-vllm
nano-vllm copied to clipboard
[BUG] Crashes when the prompt length exactly equals kvcache_block_size
prompts = [
"Hello" * 248,
] * 513
I ran example.py with the above prompts, and it crashed.
[rank0]: torch.AcceleratorError: CUDA error: invalid configuration argument
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
It seems that it is caused by the prefix cache, resulting in a tensor length of 0 for the model input.