lakeFS
lakeFS copied to clipboard
Add identifier to branches and tags
Branches (and tags) provide human-readable sources for commit references, and are great for users! But lakeFS provides CRUD operations for them, which makes it hard to write applications that handle them. Add an additional identifier to branches and to tags that identifies "the" branch/tag consistently through deletes (and, in future, renames). An explicit intent is to allow applications to compare commits on a branch by their generation.
Usage:
- Return the identifier in a new field from GetBranch / GetTag.
- When creating a new branch/tag, give it a new identifier value - a nanoID (or other unique identifier) generated on lakeFS.
- Optionally allow using the new identifier value to dereference a commit (so using a UUID is probably a bad idea, they look a bit too much like digests). Probably not in the initial release, as it requires a secondary index.
- A hard reset of a branch does not modify the ID. Applications (such as replication) that hard-reset branches should only move the branch HEAD along a single history, or break such comparisons.
- An application can always delete and re-create a branch at a new location, if it really wants to "create a new branch" and not to "modify an existing branch").
@nopcoder (probably) has additional use-cases and context.
Sample use-case
User requirement
Replicate some information from the commits graph to an external database. For instance, add to an RDBMS a mapping $\mbox{branch_name} \rightarrow (\mbox{last_committer_user}, \mbox{total_lines_modified})$ . Do not slow down commits.
Implementation notes
Use hooks.
An RDBMS might fail or computing # lines changed might be slow, so this cannot be a pre-commit hook. Instead use a post-commit hook! But post-commit actions can overlap - use the commit generation field to order them. The DDL for the table (untested, sorry, but this will be roughly correct for most SQL variants):
CREATE TABLE last_commit (
branch_identifier STRING PRIMARY KEY,
branch_name STRING, -- nonunique, but probably index this one too
generation BIGINT,
username STRING,
lines_changes INT
)
During post-commit, compute the number of lines changed, then UPDATE by branch_identifier IF the generation field increases.
When fetching for a branch name, fetch by name. If multiple identifiers exist then fetch the one which matches the current branch identifier.
This solution never blocks, gives eventual consistency, and handles a branch being deleted and then a new branch created with the same name.
@arielshaqed Hello, I'm interested in this question. Can I do this question here? Is there any time limit?
@nopcoder Hello, I'm interested in this question. Can I do this question here? Is there any time limit?
@VH992098059 I removed the good first issue label from this one. This issue will probably require API design first in order to address it.
@nopcoder Is there a time limit? I can try it