[RFC] [Store] support Mooncake file eviction in DFS/3FS
Motivation
As mentioned in the previous issue #578 #924, Mooncake has already support KVCache offloading to SSD in 3FS. However, there is currently no mechanism for file cleanup, we'd like to share our initial proposal for achieving it.
Proposed Change
The solution is mainly divided into two parts:
File Storage Monitor
Providing SSD monitoring-related metrics and distinguish them from memory metrics. Based on the deployment topology of mooncake_master and DFS, two scenarios can be identified:
1.The node where mooncake_master is deployed has DFS mounted, allowing direct access to DFS for monitoring DFS storage usage
2.mooncake_master is deployed independently, meaning it cannot directly access DFS. In this case, storage monitoring relies on client cooperation. I believe there are two main implementation approaches:
- Introduce a disk-specific allocator, similar to memory management mechanisms, where storage-usage is refreshed via the allocator(such as CachelibAllocator or OffsetAllocator ) during
PutStart - Rely on an event-driven approach during
PutStartandPutEnd
Eviction Strategy A simple lease mechanism can be used to approximate LRU policy(remove the soft-pin mechanism), prioritizing the eviction of expired KVCache. The lease TTL can be configured to one hour or longer (with a user-configurable parameter). Lease terms should be extended in two scenarios:
- During a lookup, the required KVCache is found on DFS and loaded into memory
- Upon the initial write of KVcache to disk (triggered by a client
Putor memory-driven eviction to DFS)
Given that the read/write speed and latency of SSDs are much (over 100x) slower than DRAM, it is necessary to reduce the monitoring frequency and eviction frequency, A feasible approach is to set a higher eviction ratio; for instance, each eviction cycle could clear 30% of the target space.
Further
The above discussion is based on the assumption of a 3rd-party Distributed Store, for more complex situations(such as Local Store / Global Store mentioned by #171), a more comprehensive strategy needs to be designed.
Welcome to discuss!
Thank you very much for your thoughts and contributions regarding the mooncake DFS file-eviction implementation. Let me also share a few of my own ideas.
Besides the file-eviction mechanism you already described, there are two additional aspects that are relevant to the overall management of a tiered-cache file system:
-
The client-side remove / removeAll APIs. Today these calls do not actually delete any file data in the DFS scenario; they only drop the metadata kept by the master. Future PRs will probably need to add real data deletion. (We will have to decide whether the client or the master should perform the actual file deletion.)
-
Cleaning up the file-system data when the cluster crashes or shuts down. For example, if the master crashes, all of its metadata is lost, yet every file’s data remains on disk. We may want a way to purge that leftover data (e.g., inside a signal handler). In the 3FS case, however, we have observed that deleting a huge number of files can take a very long time.
As for the file-eviction part itself, the file-storage monitor has two usage patterns:
- Pattern A (stronger constraint: the master must be able to access the DFS mount point) is stricter, but it is easier to implement because we can largely reuse the existing memory-eviction code and asynchronously delete both metadata and file data in one shot.
- Pattern B (almost no topological requirement) lets the client drive the eviction. I noticed that your description of this second pattern was rather brief; perhaps you could elaborate later on how a client-driven file-eviction could actually be implemented?
As for the file-eviction part itself, the file-storage monitor has two usage patterns:
- Pattern A (stronger constraint: the master must be able to access the DFS mount point) is stricter, but it is easier to implement because we can largely reuse the existing memory-eviction code and asynchronously delete both metadata and file data in one shot.
- Pattern B (almost no topological requirement) lets the client drive the eviction. I noticed that your description of this second pattern was rather brief; perhaps you could elaborate later on how a client-driven file-eviction could actually be implemented?
My intended meaning is that the master relies on the client to actively report DFS usage, hence describing it as event-driven. I realize my previous description wasn't accurate enough, and I'll update my RFC later. Here's my new thinking:
- For file eviction, simply updating metadata is insufficient, file cleanup is also required, which differs from memory management.
- Data writing to DFS is completed within the
PutToLocalFilecall. During this process, the client checks DFS usage and reports it to the master via RPC to determine whether file eviction should be triggered. If the client performs the eviction, it notifies the master via RPC upon completion, after which the master updates the metadata. The entire process is asynchronous and does not block the memoryPut. - The benefit of this client-driven, master-managed approach is that it can later be extended to support a hybrid architecture combining local_store and DFS.