Implement Persistent Prompt Cache to Reduce Time-to-First-Token in Chat Contexts
Implement a persistent prompt caching mechanism similar to the one used in mlx-lm (reference: https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/generate.py#L317-L318) to improve efficiency in chat applications. Motivation: As chat conversations grow longer, the time-to-first-token currently increases due to the need to recompute previous tokens. A persistent cache would allow us to reuse previous computations, maintaining consistent response times regardless of conversation length. Implementation Notes:
Review the mlx-lm implementation for insights on caching approach Design a mechanism to store and reuse KV cache between inference calls Ensure proper cache invalidation when context changes Add configuration options for cache size limits and persistence behavior
References: