Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

[RFC]: [Store] KVCache offloading to SSD in DFS

Open SgtPepperr opened this issue 5 months ago • 7 comments

Changes proposed

As mentioned in the previous issue #171,#333 , by offloading KV cache to SSDs to support Mooncake's multi-level caching mechanism, we can further improve the reuse rate of KV cache and address the issue of limited DRAM space in certain scenarios.

Currently, we have implemented Version 1 of KV cache offloading, #437 with the following mechanisms:

  • Client-side persistence: We plan to offload and install KV cache on DFS (3FS) to facilitate unified file synchronization across nodes. All read/write/query operations for KV cache objects are performed entirely on the client side, with the master node remaining unaware of them. The index mapping from keys to KV cache objects in the file system is maintained by a fixed indexing mechanism, where each file corresponds to a KV cache object (the filename serves as the key).
  • POSIX read/write: Currently, all file I/O operations are performed using POSIX interfaces. For put/batchput operations, we only submit a persistence request to the thread pool after a successful in-memory write, without further verification of write success. (If the write fails, the file is automatically deleted to prevent indexing by other instances.) For get operations, synchronous reads are used, while batchget employs asynchronous batch reads to improve throughput.

Future To-Do List

  1. Native 3FS Interface (Merged) Since the ultimate goal is to support this persistence feature on 3FS, and the current POSIX implementation (via FUSE) still impacts I/O performance, we plan to introduce a 3FS-native plugin interface to further optimize file read performance for get/batchget.

  2. Master-Managed KV Cache in SSD (Merged) The current implementation manages SSD KV cache on the client side, with metadata synchronization handled by DFS (the master remains unaware). While this approach ensures loose coupling, the lack of centralized management introduces consistency and performance issues. Future plans include migrating KV cache metadata to the master, leveraging an extended replica mechanism to support both memory and disk modes. Benefits include:

    • Reduced query latency: Currently, query/exist operations require filesystem access, incurring high overhead for large datasets. Moving metadata to the master enables single-RPC lookups for SSD/memory status.
    • Consistent behavior: Ensures alignment with memory semantics for operations like removeAll and tearDownAll.
    • Race condition mitigation: Resolves issues like "remove-before-write" through centralized coordination.
  3. File Eviction Mechanism (WIP) Currently, file deletion relies on manual user calls (remove/removeAll) or admin intervention. Without automatic eviction, long-running clusters risk storage bloat. Future versions will introduce monitoring and auto-eviction policies.

  4. Master-Triggered Eviction & Persistence (WIP) Presently, every successful put triggers persistence, effectively backing up KV cache entries. We aim to shift persistence to the master’s eviction phase, where evicted data is written to SSDs. Challenges include:

    • The master currently handles only metadata, not data flow.
    • Data distribution across nodes complicates persistence during eviction.
      A well-designed solution will be explored in future iterations.

We welcome feedback and suggestions on this design and implementation.

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues and read the documentation

SgtPepperr avatar Jul 02 '25 06:07 SgtPepperr

leveraging an extended replica mechanism to support both memory and disk modes. 这一个能具体说一下将来的一个形态嘛?跟现在的 持久化方案冲突嘛?

tianlang-wq avatar Sep 22 '25 01:09 tianlang-wq

leveraging an extended replica mechanism to support both memory and disk modes. 这一个能具体说一下将来的一个形态嘛?跟现在的 持久化方案冲突嘛?

目前该设计已经实现并进入main分支了,具体说来就是对replica副本做了一个类型扩展,使其同时支持memory replica和disk replica两种类型。和目前的持久化方案不冲突。

SgtPepperr avatar Sep 25 '25 08:09 SgtPepperr

假设 副本数为2 我可以控制主存里存一份,硬盘存一份,这个硬盘是通过https://github.com/kvcache-ai/Mooncake/pull/793 这里实现的方式提供的?

tianlang-wq avatar Sep 25 '25 08:09 tianlang-wq

DFS 持久化操作和NVMEof是两套不同的支持,我这里描述的都是不考虑NVMEof部分 #793 的实现。如果开启了持久化功能,会在磁盘上存一个副本,内存的副本数由初始化指定的参数来决定。

SgtPepperr avatar Sep 25 '25 08:09 SgtPepperr

嗷嗷明白了,期待发版

tianlang-wq avatar Sep 25 '25 08:09 tianlang-wq

DFS 持久化操作和NVMEof是两套不同的支持,我这里描述的都是不考虑NVMEof部分 #793 的实现。如果开启了持久化功能,会在磁盘上存一个副本,内存的副本数由初始化指定的参数来决定。

DFS和NVMof是两套不同实现,我的理解如下:1、当前的kvcache offload在master-service上触发 2、如果是dfs,直接在master结点上执行写操作;如果是NVMEof,则需要将数据传输给对应client 完成写入其SSD的过程 请问我的理解正确吗? @SgtPepperr

yejj710 avatar Oct 21 '25 03:10 yejj710

DFS 持久化操作和NVMEof是两套不同的支持,我这里描述的都是不考虑NVMEof部分 #793 的实现。如果开启了持久化功能,会在磁盘上存一个副本,内存的副本数由初始化指定的参数来决定。

DFS和NVMof是两套不同实现,我的理解如下:1、当前的kvcache offload在master-service上触发 2、如果是dfs,直接在master结点上执行写操作;如果是NVMEof,则需要将数据传输给对应client 完成写入其SSD的过程 请问我的理解正确吗? @SgtPepperr

目前dfs分级缓存的实现是write through的实现,而不是write back的实现,具体来说就是client在进行put时会在内存池里写一份数据,并异步在DFS再写一个备份,因此文件写操作是在client上完成的,但是内存和文件的metadata元数据信息都会保存在master service中,当内存池空间不足,master会发起evcit,将内存空间的数据释放掉,但仍会保留DFS上KVCache的元数据信息。

SgtPepperr avatar Oct 21 '25 11:10 SgtPepperr