mllm migrate kv cache to pytorch

I would like to ask if the kv cache generated in the prefilling stage can be used as pytorch's kvcache to allow pytorch to perform subsequent decoding work on another device.

Feb 27 '25 06:02 zyf-gh

Our KVCache is different from PyTorch's KV Cache Layout. Since you need to transfer the data to another device, you can reorganize the memory layout of our KVCache before transferring it.

Feb 27 '25 06:02 chenghuaWang

Our KVCache is different from PyTorch's KV Cache Layout. Since you need to transfer the data to another device, you can reorganize the memory layout of our KVCache before transferring it.

Where can I find the specific differences in their layouts? I called key.printData<mllm_fp16_t>() in modeling_phonelm.hpp and found that the printed data is printed according to(batch, head, sequence, dimension), which is the same as pytorch.

Feb 27 '25 08:02 zyf-gh

The sequence length of the KVCache in MLLM is set to cache_limit. If I remember correctly, the PyTorch implementation of KVCache involves concatenating a list of tensors. To obtain a properly formatted KVCache compatible with PyTorch, you may need to perform slicing operations on the KVCache from MLLM before it can be effectively used within PyTorch.

Feb 27 '25 09:02 chenghuaWang