mlx-examples [Feature Request] MLX_lm: Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs

It would be great if MLX_lm supported a --cache_prompt flag like in llama.cpp's integration (link to their discussion + eventual PR).

This would be a big benefit in reducing latency / start up time for repeated runs that include the same prompts - e.g. chatbot applications with a running chatlog, or when using long multishot examples since these take up a lot of tokens (and are very useful for production environments).

I'm no expert so likely won't be able to assist with a PR on this :'( But looking at the discussion for the initial llama.cpp integration, I have a couple of observations:

It seems that the cache should store the model's memory_k and memory_v data, along with other necessary state information incl. (but not limited to?) n_past, RNG state, logits, and embedding vectors
I think the llama.cpp team may have implemented some form of lightweight compression (zstd?) for the cached data, as apparently the memory tensors are often compressible, especially when the full context is not used

The default behaviour of a --cache-prompt flag could be to just save the KV cache to the model folder, like is done with trained adapters; though it'd be useful to then have additional flags to maybe clear or delete the cache afterwards

Aug 02 '24 14:08 mark-lord

take a look at this open source library / repo: https://github.com/otriscon/llm-structured-output they have an implementation of a reusable KV cache for mlx. i've gotten it working - works surprisingly well!

Aug 03 '24 17:08 Jckwind

I think we can close this! 🚀

Documentation on usage.

Aug 30 '24 13:08 awni