migrate kv cache to pytorch
I would like to ask if the kv cache generated in the prefilling stage can be used as pytorch's kvcache to allow pytorch to perform subsequent decoding work on another device.
Our KVCache is different from PyTorch's KV Cache Layout. Since you need to transfer the data to another device, you can reorganize the memory layout of our KVCache before transferring it.
Our KVCache is different from PyTorch's KV Cache Layout. Since you need to transfer the data to another device, you can reorganize the memory layout of our KVCache before transferring it.
Where can I find the specific differences in their layouts? I called key.printData<mllm_fp16_t>() in modeling_phonelm.hpp and found that the printed data is printed according to(batch, head, sequence, dimension), which is the same as pytorch.
The sequence length of the KVCache in MLLM is set to cache_limit. If I remember correctly, the PyTorch implementation of KVCache involves concatenating a list of tensors. To obtain a properly formatted KVCache compatible with PyTorch, you may need to perform slicing operations on the KVCache from MLLM before it can be effectively used within PyTorch.