[Performance]: benchmarking vllm copy kernel and pytorch index copy
Proposal to improve performance
I opened this issue to track a random idea:
Currently we have a copy kernel:
https://github.com/vllm-project/vllm/blob/e288df0632d5bdde76c20bed8310b46d35b8e5ac/csrc/cache_kernels.cu#L214-L220
Essentially this does the following vector copy:
key_cache_view = key_cache.reshape(-1, num_heads * head_size)
value_cache_view = value_cache.reshape(-1, num_heads * head_size)
key_view = key.reshape(-1, num_heads * head_size)
value_view = value.reshape(-1, num_heads * head_size)
key_cache_view[slot_mapping] = key_view
value_cache_view[slot_mapping] = value_view
The caveat is, we have a special value in slot_mapping: -1 means skip copying.
If possible, we can reserve a slot in block manager for padded kv, then we can just use pytorch's index copying, without maintaining a separate copy kernel ourselves.
Two TODOs:
- [ ] What is the overhead of reserving a slot for padded kv in the block manager?
- [ ] Does PyTorch copy kernel outperform the current hand-written one?
cc @cadedaniel who knows a lot about block manager, and @WoosukKwon who knows a lot about cuda kernels.
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
No response
idea seems good to me. the block manager v2 will soon support the notion of a null block. we can extend it to allocate such a null block even when sliding window is disabled.
https://github.com/vllm-project/vllm/pull/4545/files#diff-c5d7846ef0a9ff5a745d767aa28fea36bd34a97b1ef4c31ad9f8f48bcd9730b4R127
looking forward to that!
Willing to help with benchmarking PyTorch copy kernel v.s. the current hand-written one
I benchmarked the performance of vLLM KV cache copy kernel v.s. pytorch index copy. Seems that vLLM kernel is faster. vLLM kernel: 4.4006272315979 ms Torch kernel: 11.914390563964844 ms
The benchmarking script that I use: benchmark_kernel.txt
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!