lakeFS icon indicating copy to clipboard operation
lakeFS copied to clipboard

Support Tags in GC Retention Policy

Open guy-har opened this issue 11 months ago • 3 comments

We would like to allows tagging specific commits for retention in the garbage collection (GC) policy, alongside the existing functionality that retains commits at the HEAD of branches. This would enable more granular control over data retention, allowing users to preserve important commits outside of the branch HEADs for reasons such as marking releases, compliance, or significant milestones.

guy-har avatar Apr 02 '24 15:04 guy-har

https://lakefs.slack.com/archives/C016726JLJW/p1712066081097729

guy-har avatar Apr 02 '24 15:04 guy-har

It can be useful to mark experiment and model training runs with tags. This would allow full transparency and traceability of the data

ion-elgreco avatar Apr 02 '24 18:04 ion-elgreco

[I wrote much the same on Slack]

I think it's trickier than that: it depends on the intent of the user of the tag.

Some tags should be kept until deleted. For instance, you might want to keep the tag version_to_reproduce_bug if you run GC to save space. But if you run GC for compliance then probably not. Meanwhile release_weekly_20230314 is a tag that you will probably want lakeFS to GC, if you're releasing every week.

If we unilaterally change all tags to be retained, we break existing users who expect tags to be GC'ed.

I don't have a good suggestion for this, other than "add another configuration flag [to each tag]".

Why are branches easier than tags?

I think that a major difference between tags and branch heads is that branch heads are essentially mutable while tags are essentially immutable. So it is easier to assume intent behind branches during a GC: if the branch head has moved away from an old version, then it is clearly old; the branch head literally does not point at any objects eligible for collection. In contrast, there is no good clear default for whether or not to collect a tag. And there can be very good reasons for keeping a tag but not its contents.

Related issues

An issue that's essentially the reverse: #5058 says even to eliminate dead branches. But:

  • IF we had good logic for removing branches, and
  • IF we could apply that logic to tags, then
  • SOME but not all objections to retaining objects on tagged revisions might go away.

arielshaqed avatar Apr 03 '24 08:04 arielshaqed