mllm icon indicating copy to clipboard operation
mllm copied to clipboard

migrate kv cache to pytorch

Open zyf-gh opened this issue 10 months ago • 3 comments

I would like to ask if the kv cache generated in the prefilling stage can be used as pytorch's kvcache to allow pytorch to perform subsequent decoding work on another device.

zyf-gh avatar Feb 27 '25 06:02 zyf-gh

Our KVCache is different from PyTorch's KV Cache Layout. Since you need to transfer the data to another device, you can reorganize the memory layout of our KVCache before transferring it.

chenghuaWang avatar Feb 27 '25 06:02 chenghuaWang

Our KVCache is different from PyTorch's KV Cache Layout. Since you need to transfer the data to another device, you can reorganize the memory layout of our KVCache before transferring it.

Where can I find the specific differences in their layouts? I called key.printData<mllm_fp16_t>() in modeling_phonelm.hpp and found that the printed data is printed according to(batch, head, sequence, dimension), which is the same as pytorch.

zyf-gh avatar Feb 27 '25 08:02 zyf-gh

The sequence length of the KVCache in MLLM is set to cache_limit. If I remember correctly, the PyTorch implementation of KVCache involves concatenating a list of tensors. To obtain a properly formatted KVCache compatible with PyTorch, you may need to perform slicing operations on the KVCache from MLLM before it can be effectively used within PyTorch.

chenghuaWang avatar Feb 27 '25 09:02 chenghuaWang