vllm
vllm copied to clipboard
[Misc]: Improving VLLM KVCACHE Transfer Efficiency with NCCL P2P Communication
Anything you want to discuss about vllm.
I hope to utilize the NCCL point-to-point communication protocol (P2P) to transfer the VLLM KVCACHE from the prefill node to the decode node for decoupled inference. Since the KVCACHE in VLLM is stored as a list of tensors, I need to send approximately 16,384 block slices every time I transmit 128 blocks. The non-contiguous distribution of these slices in GPU memory leads to low efficiency in the cyclic transmission, preventing optimal utilization of the communication bandwidth.
Therefore, I am considering concatenating these slices into a single large tensor for transmission. On the receiving end, the node would split this large tensor and write the data back to the corresponding positions based on the slice indices. However, this process is quite time-consuming due to the involvement of up to 16,384 slices. I would like to know if CUDA operations can be utilized to parallelize this process in order to improve performance.
Additionally, does this decoupled KVCACHE transfer method have any design flaws? Do you have any better suggestions?