mlx-vlm icon indicating copy to clipboard operation
mlx-vlm copied to clipboard

Implement Persistent Prompt Cache to Reduce Time-to-First-Token in Chat Contexts

Open Blaizzy opened this issue 9 months ago • 0 comments

Implement a persistent prompt caching mechanism similar to the one used in mlx-lm (reference: https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/generate.py#L317-L318) to improve efficiency in chat applications. Motivation: As chat conversations grow longer, the time-to-first-token currently increases due to the need to recompute previous tokens. A persistent cache would allow us to reuse previous computations, maintaining consistent response times regardless of conversation length. Implementation Notes:

Review the mlx-lm implementation for insights on caching approach Design a mechanism to store and reuse KV cache between inference calls Ensure proper cache invalidation when context changes Add configuration options for cache size limits and persistence behavior

References:

MLX-LM Implementation

Blaizzy avatar May 06 '25 22:05 Blaizzy