Mooncake
Mooncake copied to clipboard
[mooncake-store]: prevent orphaned bucket data files from leaking dis…
Description
Add cleanup mechanism for bucket data files that are left orphaned when the process crashes between writing the data file and its metadata file.
Problem:
- BucketStorageBackend writes data file first, then .meta file
- If crash/failure occurs between these steps, data file is orphaned
- Init() only scans .meta files, never discovers orphans
- Orphaned files accumulate indefinitely, leaking disk space
Solution:
- Add immediate cleanup in WriteBucket() when metadata write fails
- Add orphan detection during Init() to clean up past crashes
- Make cleanup opt-in (disabled by default) via constructor flag
- Identify orphans by: numeric filename, no extension, no .meta pair
Changes:
- BucketStorageBackend: Add enable_orphan_cleanup parameter (default: false)
- WriteBucket(): Clean up data file if StoreBucketMetadata() fails
- Init(): Scan for and remove orphaned data files when enabled
- Add comprehensive test coverage for orphan cleanup
Issue link: https://github.com/kvcache-ai/Mooncake/issues/1139
Type of Change
- Types
- [X] Bug fix
- [ ] New feature
- [ ] Transfer Engine
- [x] Mooncake Store
- [ ] Mooncake EP
- [ ] Integration
- [ ] P2P Store
- [ ] Python Wheel
- [ ] Breaking change
- [ ] CI/CD
- [ ] Documentation update
- [ ] Other
How Has This Been Tested?
- OrphanedDataFileCleanup: Creates valid bucket, manually orphans files, verifies cleanup
- Tests cleanup enabled vs disabled (default behavior)
- Validates valid buckets preserved during orphan cleanup
- Tests file count reduction after cleanup
Checklist
- [x] I have performed a self-review of my own code.
- [ ] I have updated the documentation.
- [x] I have added tests to prove my changes are effective.
Summary of Changes
Hello @maheshrbapatu, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a critical fix to prevent disk space leakage in the mooncake-store by addressing the issue of orphaned bucket data files. Previously, if a system crash occurred between writing a data file and its metadata, the data file would remain on disk indefinitely. The changes implement both immediate cleanup during write failures and a proactive scan during initialization to detect and remove these orphaned files, improving the storage backend's robustness and efficiency.
Highlights
- Orphaned File Prevention: Implemented a mechanism to prevent disk space leaks caused by orphaned bucket data files.
- Crash Consistency: Addresses scenarios where a process crashes between writing a data file and its corresponding metadata file.
- Immediate Cleanup: Adds logic to
WriteBucket()to remove data files if their metadata write fails. - Proactive Cleanup: Introduces an orphan detection and cleanup process during
Init()to handle files from past crashes. - Opt-in Feature: The orphan cleanup functionality is opt-in, controlled by a new
enable_orphan_cleanupflag in theBucketStorageBackendconstructor. - Identification Criteria: Orphans are identified by having a purely numeric filename, no extension, and no associated
.metafile.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.
There is a format check in the CI. Try clang-format -i FILENAME to format the code.
-
The parameter enable_orphan_cleanup_ doesn't seem to be necessary.
-
In the case of metadata write failure, we can simply clean up the corresponding data file within the WriteBucket function.
-
Following the above logic, is it still necessary to use the .bucket suffix?
- The parameter enable_orphan_cleanup_ doesn't seem to be necessary.
- In the case of metadata write failure, we can simply clean up the corresponding data file within the WriteBucket function.
- Following the above logic, is it still necessary to use the .bucket suffix?
According to the discussion on the slack thread, there are two possible scenarios:
If the data write succeeds but StoreBucketMetadata fails (e.g., disk full, I/O error) or if the process crashes specifically between these two steps, it seems we would be left with a fully written data file on disk but no corresponding .metafile.
The WriteBucket function itself cannot handle the second case.
- The parameter enable_orphan_cleanup_ doesn't seem to be necessary.
- In the case of metadata write failure, we can simply clean up the corresponding data file within the WriteBucket function.
- Following the above logic, is it still necessary to use the .bucket suffix?
According to the discussion on the slack thread, there are two possible scenarios:
If the data write succeeds but StoreBucketMetadata fails (e.g., disk full, I/O error) or if the process crashes specifically between these two steps, it seems we would be left with a fully written data file on disk but no corresponding .metafile.
The
WriteBucketfunction itself cannot handle the second case.
Should the second case be handled in the init function instead of the BatchOffload function?
The
WriteBucketfunction itself cannot handle the second case.Should the second case be handled in the init function instead of the BatchOffload function?
I believe this PR does handle that case in the init function — unless I’m misunderstanding something. Let me know if I’ve overlooked anything.
The
WriteBucketfunction itself cannot handle the second case.Should the second case be handled in the init function instead of the BatchOffload function?
I believe this PR does handle that case in the init function — unless I’m misunderstanding something. Let me know if I’ve overlooked anything.
Sorry, I made a mistake — there's only one question left: is it necessary to keep the enable_orphan_cleanup_ parameter?
The
WriteBucketfunction itself cannot handle the second case.Should the second case be handled in the init function instead of the BatchOffload function?
I believe this PR does handle that case in the init function — unless I’m misunderstanding something. Let me know if I’ve overlooked anything.
Sorry, I made a mistake — there's only one question left: is it necessary to keep the enable_orphan_cleanup_ parameter?
Hello Yes we can remove enable_orphan_cleanup_ parameter which is not necessary at this point, I have added it initially as I was unclear about bucket files as they didnt have .bucket extension.
Should the second case be handled in the init function instead of the BatchOffload function? As @ykwd already pointed we are cleaning up it in init.
In the case of metadata write failure, we can simply clean up the corresponding data file within the WriteBucket function Yes we can do this and we are already doing this as @ykwd mentioned. But if the node crashes after writing data but before writing metadata then we still need to clean it up during init.
Documenting these just for the reference of the community as most of the questions were already answered.