vllm [Performance]: benchmarking vllm copy kernel and pytorch index copy

Proposal to improve performance

I opened this issue to track a random idea:

Currently we have a copy kernel:

https://github.com/vllm-project/vllm/blob/e288df0632d5bdde76c20bed8310b46d35b8e5ac/csrc/cache_kernels.cu#L214-L220

Essentially this does the following vector copy:

    key_cache_view = key_cache.reshape(-1, num_heads * head_size)
    value_cache_view = value_cache.reshape(-1, num_heads * head_size)
    key_view = key.reshape(-1, num_heads * head_size)
    value_view = value.reshape(-1, num_heads * head_size)
    key_cache_view[slot_mapping] = key_view
    value_cache_view[slot_mapping] = value_view

The caveat is, we have a special value in slot_mapping: -1 means skip copying.

If possible, we can reserve a slot in block manager for padded kv, then we can just use pytorch's index copying, without maintaining a separate copy kernel ourselves.

Two TODOs:

[ ] What is the overhead of reserving a slot for padded kv in the block manager?
[ ] Does PyTorch copy kernel outperform the current hand-written one?

cc @cadedaniel who knows a lot about block manager, and @WoosukKwon who knows a lot about cuda kernels.

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

No response

May 09 '24 02:05 youkaichao

idea seems good to me. the block manager v2 will soon support the notion of a null block. we can extend it to allocate such a null block even when sliding window is disabled.

https://github.com/vllm-project/vllm/pull/4545/files#diff-c5d7846ef0a9ff5a745d767aa28fea36bd34a97b1ef4c31ad9f8f48bcd9730b4R127

May 10 '24 04:05 cadedaniel

looking forward to that!

May 10 '24 05:05 youkaichao

Willing to help with benchmarking PyTorch copy kernel v.s. the current hand-written one

May 14 '24 05:05 KuntaiDu

I benchmarked the performance of vLLM KV cache copy kernel v.s. pytorch index copy. Seems that vLLM kernel is faster. vLLM kernel: 4.4006272315979 ms Torch kernel: 11.914390563964844 ms

The benchmarking script that I use: benchmark_kernel.txt

May 20 '24 17:05 KuntaiDu

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Oct 27 '24 02:10 github-actions[bot]