lakeFS icon indicating copy to clipboard operation
lakeFS copied to clipboard

Add identifier to branches and tags

Open arielshaqed opened this issue 5 months ago • 1 comments

Branches (and tags) provide human-readable sources for commit references, and are great for users! But lakeFS provides CRUD operations for them, which makes it hard to write applications that handle them. Add an additional identifier to branches and to tags that identifies "the" branch/tag consistently through deletes (and, in future, renames). An explicit intent is to allow applications to compare commits on a branch by their generation.

Usage:

  • Return the identifier in a new field from GetBranch / GetTag.
  • When creating a new branch/tag, give it a new identifier value - a nanoID (or other unique identifier) generated on lakeFS.
  • Optionally allow using the new identifier value to dereference a commit (so using a UUID is probably a bad idea, they look a bit too much like digests). Probably not in the initial release, as it requires a secondary index.
  • A hard reset of a branch does not modify the ID. Applications (such as replication) that hard-reset branches should only move the branch HEAD along a single history, or break such comparisons.
    • An application can always delete and re-create a branch at a new location, if it really wants to "create a new branch" and not to "modify an existing branch").

@nopcoder (probably) has additional use-cases and context.

arielshaqed avatar Jun 09 '25 06:06 arielshaqed

Sample use-case

User requirement

Replicate some information from the commits graph to an external database. For instance, add to an RDBMS a mapping $\mbox{branch_name} \rightarrow (\mbox{last_committer_user}, \mbox{total_lines_modified})$ . Do not slow down commits.

Implementation notes

Use hooks.

An RDBMS might fail or computing # lines changed might be slow, so this cannot be a pre-commit hook. Instead use a post-commit hook! But post-commit actions can overlap - use the commit generation field to order them. The DDL for the table (untested, sorry, but this will be roughly correct for most SQL variants):

CREATE TABLE last_commit (
  branch_identifier STRING PRIMARY KEY,
  branch_name STRING,   -- nonunique, but probably index this one too
  generation BIGINT,
  username STRING,
  lines_changes INT
)

During post-commit, compute the number of lines changed, then UPDATE by branch_identifier IF the generation field increases.

When fetching for a branch name, fetch by name. If multiple identifiers exist then fetch the one which matches the current branch identifier.

This solution never blocks, gives eventual consistency, and handles a branch being deleted and then a new branch created with the same name.

arielshaqed avatar Jun 09 '25 07:06 arielshaqed

@arielshaqed Hello, I'm interested in this question. Can I do this question here? Is there any time limit?

VH992098059 avatar Jul 08 '25 12:07 VH992098059

@nopcoder Hello, I'm interested in this question. Can I do this question here? Is there any time limit?

VH992098059 avatar Jul 10 '25 03:07 VH992098059

@VH992098059 I removed the good first issue label from this one. This issue will probably require API design first in order to address it.

nopcoder avatar Jul 10 '25 16:07 nopcoder

@nopcoder Is there a time limit? I can try it

VH992098059 avatar Jul 11 '25 04:07 VH992098059