Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

[RFC]: Add Local Cache Mechanism for Mooncake Store Client

Open Shichang-Zhang opened this issue 2 months ago • 7 comments

Changes proposed

Introduction

Hello Mooncake community, We are from the openFuyao community and are actively working on cloud-native LLM inference acceleration. We would like to share our initial proposal, which introduces a hot data caching mechanism based on the LRU (Least Recently Used) eviction strategy for Mooncake Store. This optimization aims to improve memory utilization, reduce cross-node network access overhead, enhance local availability of hot data, and thereby improve the performance of single-key and batch get operations in distributed KV Cache storage.

Motivation

The Master Service randomly assigns the data slices to storage nodes in the cluster. This architectural design results in frequent cross-node network transmission to get each slice when accessing specific KV Cache data. Furthermore, research from the Mooncake paper indicates that over 50% of data is accessed only once, while some hot data can be accessed tens of thousands of times.

This provides clear directions for the optimization:

  • Improve local node hit rate for storage slices
  • Improve read efficiency for data slices stored on the local node

This proposal introduces a client-side caching mechanism based on the LRU eviction strategy, which accelerates KV Cache reads by caching hot data, thereby improving TTFT.

Goals

  • Performance Improvement: After enabling local hot cache optimization and when local slices hit rate reaches 30%+, the TTFT metric is expected to improve by 10% compared to the disabled scenario
  • Local Hit Rate Improvement: Under typical workloads, local slices hit rate improves to 30%, get/batchGet interface latency is expected to reduce by 40%+
  • Interface Compatibility: Maintain Mooncake's get/batchGet interface unchanged, supporting multiple inference engines (including vLLM, SGLang, etc.) to use KV Cache read acceleration

Non-Goals

  • Do not involve optimization at the SSD storage layer
  • Do not involve Master Service refactoring or metadata descriptor struct changes

Proposal

Architecture

This section provides an overall architectural view of the Mooncake Store acceleration solution.

Logical Architecture View

This proposal enhances Mooncake's storage strategy:

Image

Logical Architecture Description:

  • Inference Engine Layer: Maintain interface compatibility, no modifications required
  • Mooncake Client: Add hot cache module (highlighted in green), optimize transfer submission logic
  • Transport Layer: Process local and remote transfers in parallel
  • Storage Layer: Master Service manages distributed storage. Add local hot cache storage (highlighted in green).

Implementation

Feature Overview

The current Mooncake get/batchGet request processing flow is: query metadata → submit transfer task → wait for transfer completion → return data directly.

The optimization introduces a hot data caching mechanism on the Client side:

  • During Client initialization, allocate dedicated memory region for storing hot slices
  • When receiving get/batchGet requests, check local cache first
  • After Transfer Engine completes remote slices reads, asynchronously put remote slices to local cache

Sequence Diagram

Image
  1. Initialization Phase: When creating Client, initialize hot cache based on configuration parameter local_hot_cache_size_ (default value 0, standing for disabled)
  2. Metadata Query: When Get/BatchGet requests arrive, first query Master service to obtain storage location information (replica descriptor) for target KV Cache
  3. Cache Query: Iterate through all slices and query local cache. If cache hit, update replica descriptor to local store address
  4. Transfer Submission: Submit read requests to Transfer Submitter. If target address is local node, automatically select LOCAL_MEMCPY transport strategy for optimization
  5. Cache Update: After transfer completion, update remotely transferred slices to local hot cache through asynchronous task handler

Data Structures

  • New Client class members:

    • hot_cache_: Hot cache manager instance
    • hot_cache_handler_: Hot cache asynchronous task handler instance
  • LocalHotCache class: Hot cache management class for cache lifecycle management, it involves:

    • total_size_bytes_: Hot cache size limit (in bytes)
    • blocks_: Hot cache memory block list, store actual cached slices
    • lru_queue_: Doubly linked list, maintain LRU eviction strategy
    • key_to_lru_it_: Cache key-value mapping table, where key is {request_key}_{slice_index} and value is LRU list node iterator
    • lru_mutex_: Mutex for concurrent access protection
  • HotMemBlock struct: Hot cache data block structure

    • address_: Memory address pointer
    • size_: Data block size (in bytes)
  • LocalHotCacheHandler class: Hot cache asynchronous task handler

    • Asynchronously update TE transferred slices to hot cache after Get/BatchGet read complete

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues and read the documentation

Shichang-Zhang avatar Nov 15 '25 05:11 Shichang-Zhang

Good idea. LGTM. Looking forward to your PR! Local caching can improve performance, and we should consider making this feature optional for the generic client.

stmatengss avatar Nov 17 '25 16:11 stmatengss

Another ideal, it seems we should support two kinds of clients. One is a standalone client without master. The other is distributed deployment with many clients and a master.

stmatengss avatar Nov 17 '25 16:11 stmatengss

Great idea! I have a few more questions:

  1. The configuration of the local_hot_cache_size_ value seems to strongly depend on personal experience and application scenarios. Is this configuration value necessary?
  2. Is the eviction policy for the local hot cache the same as that for the non-hot cache? @Shichang-Zhang

Keithwwa avatar Nov 18 '25 02:11 Keithwwa

Great idea! I have a few more questions:

  1. The configuration of the local_hot_cache_size_ value seems to strongly depend on personal experience and application scenarios. Is this configuration value necessary?
  2. Is the eviction policy for the local hot cache the same as that for the non-hot cache? @Shichang-Zhang

Thanks for you r comments!

  1. Yes. The configuration of local_hot_cache_size_ is strongly depends on the inference scenario. We expect to achieve an optimal balance, where the benefit from reduced cross-node data transfer outweighs the extra local cache memory cost. local_hot_cache_size_ is configurable, user can simply disable by setting it to 0. Currently we are conducting tests across several scenarios, and we will detail the configuration recommendations with test scenarios in the PR.
  2. Yes. Currently, the eviction policy for local hot cache is LRU, but there are more optimizations for non-hot cache LRU eviction such as soft-pin. They have no straightforward relationship.

Shichang-Zhang avatar Nov 18 '25 10:11 Shichang-Zhang

Good idea. LGTM. Looking forward to your PR! Local caching can improve performance, and we should consider making this feature optional for the generic client.

Appreciate the affirmation! I’ve completed parts of implementation and performance verification. The PR will be raised in the next few days, and I’ll link it to this issue once submitted.

Shichang-Zhang avatar Nov 18 '25 10:11 Shichang-Zhang

Interesting idea!

Mag-FelixFelicis avatar Nov 19 '25 06:11 Mag-FelixFelicis

Good idea. LGTM. Looking forward to your PR! Local caching can improve performance, and we should consider making this feature optional for the generic client.

Appreciate the affirmation! I’ve completed parts of implementation and performance verification. The PR will be raised in the next few days, and I’ll link it to this issue once submitted.

What good news!

NUABO avatar Dec 02 '25 08:12 NUABO