paimon icon indicating copy to clipboard operation
paimon copied to clipboard

[Feature] Decouple the lifecycle of snapshot and changelog

Open Aitozi opened this issue 1 year ago • 3 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Motivation

Motivation

Currently, the changelog's lifecycle is binding with the snapshot, when snapshot is expired, the changelog is expired. If we want to keep longer changelog for the streaming job to reset consume, we also have to keep the data files in snapshot, this is not necessary, and will waste much space resource.

In traditional database, the binlog's lifecycle is individual. So, In paimon, we could also support to decouple the lifecycle of snapshot and changelog . In this way, users can choose the flexible option to keep the changelog data. eg: user can keep one latest snapshot and one day's changelog for consuming.

How to implement

The lifecyle is handled in expiration process. To keep two different lifecycle, we need two group file point to mark the effective changelog and snapshot

  • EARLIEST and LATEST mark the effective datafile
  • EARLIEST_CHANGELOG and LATEST mark the effective changlog

The LATEST should be same, and the EARLIEST_CHANGELOG only present when the changelog is enabled.

image

However, the hint file is not always there. We should have a way to work when the hint file is missing. So we introduce a mark file expire-snapshot-x and expire-changelog-x when expire the corresponding object. We create the expire file first, and when the two file are created, we can delete the snapshot metadata. After that, the mark file are also deleted. The EARLIEST and EARLIEST_CHANGELOG always updated after the file created. So, if the EARLIEST_ file is missing or inaccurate, we can determine the current effective snapshot and changelog by the expire-snapshot-x and expire-changelog-x file.

New config

  • changelog.time-retained
  • changelog.num-retained.min
  • changelog.num-retained.max

The counterparts of the snapshot retained configs.

Subtask

  • [X] Introduce ExpireChangelogImpl handle the changelog expire
  • [X] Integrate the InnerStreamScan with the changlog metadata
  • [x] Handle the orphan file cleaner with the changelog metadata
  • [ ] Support decouple the delta files lifecycle
  • [ ] Support to merge the small changelog files

Anything else?

No response

Are you willing to submit a PR?

  • [X] I'm willing to submit a PR!

Aitozi avatar Feb 23 '24 11:02 Aitozi

CC @JingsongLi

Aitozi avatar Feb 23 '24 11:02 Aitozi

Thanks @Aitozi for driving this.

Overall, I feel that using hint File is a bit forced. Hint is just a hint and should not have a significant impact.

On the other hand, we can consider introducing another changelog directory, which can be quite similar to tags, we don't need to be limited to snapshots. We can completely write another JSON format to describe change logs.

When a snapshot expires, if the changelog files do not need to be deleted, create a corresponding changelog-${snapshotId} file in the changelog directory to record the lifecycle of the changelog files, and introduce a new changelog expiration to delete expired changelogs.

In this way, we can also consider merging small files in the future and maintaining files and corresponding offsets and lengths in this newly added JSON.

JingsongLi avatar Feb 27 '24 13:02 JingsongLi

Thanks @JingsongLi for your comments, the solution similar to tags sounds feasible, will follow this suggestion.

Aitozi avatar Mar 01 '24 03:03 Aitozi