[Feature] Decouple the lifecycle of snapshot and changelog
Search before asking
- [X] I searched in the issues and found nothing similar.
Motivation
Motivation
Currently, the changelog's lifecycle is binding with the snapshot, when snapshot is expired, the changelog is expired. If we want to keep longer changelog for the streaming job to reset consume, we also have to keep the data files in snapshot, this is not necessary, and will waste much space resource.
In traditional database, the binlog's lifecycle is individual. So, In paimon, we could also support to decouple the lifecycle of snapshot and changelog . In this way, users can choose the flexible option to keep the changelog data. eg: user can keep one latest snapshot and one day's changelog for consuming.
How to implement
The lifecyle is handled in expiration process. To keep two different lifecycle, we need two group file point to mark the effective changelog and snapshot
EARLIESTandLATESTmark the effective datafileEARLIEST_CHANGELOGandLATESTmark the effective changlog
The
LATESTshould be same, and theEARLIEST_CHANGELOGonly present when the changelog is enabled.
However, the hint file is not always there. We should have a way to work when the hint file is missing. So we introduce a mark file expire-snapshot-x and expire-changelog-x when expire the corresponding object. We create the expire file first, and when the two file are created, we can delete the snapshot metadata. After that, the mark file are also deleted. The EARLIEST and EARLIEST_CHANGELOG always updated after the file created. So, if the EARLIEST_ file is missing or inaccurate, we can determine the current effective snapshot and changelog by the expire-snapshot-x and expire-changelog-x file.
New config
changelog.time-retainedchangelog.num-retained.minchangelog.num-retained.max
The counterparts of the snapshot retained configs.
Subtask
- [X] Introduce ExpireChangelogImpl handle the changelog expire
- [X] Integrate the InnerStreamScan with the changlog metadata
- [x] Handle the orphan file cleaner with the changelog metadata
- [ ] Support decouple the delta files lifecycle
- [ ] Support to merge the small changelog files
Anything else?
No response
Are you willing to submit a PR?
- [X] I'm willing to submit a PR!
CC @JingsongLi
Thanks @Aitozi for driving this.
Overall, I feel that using hint File is a bit forced. Hint is just a hint and should not have a significant impact.
On the other hand, we can consider introducing another changelog directory, which can be quite similar to tags, we don't need to be limited to snapshots. We can completely write another JSON format to describe change logs.
When a snapshot expires, if the changelog files do not need to be deleted, create a corresponding changelog-${snapshotId} file in the changelog directory to record the lifecycle of the changelog files, and introduce a new changelog expiration to delete expired changelogs.
In this way, we can also consider merging small files in the future and maintaining files and corresponding offsets and lengths in this newly added JSON.
Thanks @JingsongLi for your comments, the solution similar to tags sounds feasible, will follow this suggestion.