lakeFS icon indicating copy to clipboard operation
lakeFS copied to clipboard

[Discovery] Commit only a subset of the staging area

Open johnnyaug opened this issue 4 years ago • 9 comments

EDIT: following @arielshaqed's comment, changed the issue to be a user requirement without the implementation details.

As a user, I want to be able to commit only a subset of the staging area.

This can be done by either:

  1. Adding a set of a prefixes to the commit API. The commit will only include objects starting with one of the prefixes.
  2. Implementing a git-style staging mechanism, where the user stages objects prior to the commit.
  3. Some other way.

After completing this task, a UI for it can be developed on top of #2174 - by adding checkboxes to the tree in the Uncommitted Changes view.

johnnyaug avatar Oct 03 '21 15:10 johnnyaug

This is not very git-like. git distinguishes "working tree" from "staging", and so git add ... can do what this issue suggests. But it is a separate operation from commit.

This is important because otherwise you cannot really create a convenient GUI:

  • Only the client can keep track of objects marked for commit, so if it crashes or reloads unexpectedly all work is lost.
  • Multiple users who can see the same branch cannot share their work.
  • Only the GUI can show diffs between HEAD, to-be-committed ("staging") and not-to-be-committed ('working tree").
  • All logic will live in the GUI, and will need to be developed separately and with possibly different behaviour for other clients (such as the CLI).

I submit that we should try to be like Git wherever we can.

Can we consider something like adding a "more" staging area? E.g., as in git, this is a graveler object (possibly just a delta of a graveler metarange object, although TBF it probably shouldn't matter that much) which is linked to the branch. It does not have a parent (it's not a commit, it is just a metarange). And we can add and remove files on it.

arielshaqed avatar Oct 03 '21 15:10 arielshaqed

@arielshaqed, I tend to agree. Let me edit the issue accordingly.

johnnyaug avatar Oct 03 '21 15:10 johnnyaug

Adding more context around why this feature would be helpful.

While using lakeFS for a typical data-ingestion use case, I copy the raw data into staging area staging-branch/raw/dt=2022-10-11/sampledata.json, do data exploration, run transformations and aggregations and write aggregated data in staging branch itself staging-branch/analytics/sampledata-by-country-parquet/partition0-xyz.snappy.parquet. Then I run tests on /analytics and need to merge only the /analytics dir into my prod. And after the merge, I'd clean up the staging area (i.e., delete the staging-branch).

In the data teams I worked with, the best practice was to never work on the data in ingress location directly. Always copied the data into a staging area for processing. I tried to simulate the same for a lakeFS demo and realized "git add" like functionality would be helpful so I can choose a portion of my changes for a commit.

vinodhini-sd avatar Nov 10 '22 19:11 vinodhini-sd

+1 for this. Using lakeFS as a new user and working along the "git for data" mental model that I'd been given, I instinctively reached for git add when I wanted to commit part of my working copy but not all of it.

My use case was to provide a more accurate commit history (Added data from x, Added data from y) instead of a whole lump (Added a bunch of data from x and y). I could serialise the process (add data from x, commit, add data from y, commit) but this breaks the mental model that I have from git.

rmoff avatar Mar 01 '23 10:03 rmoff

+1 for this, from a paying user. I agree with @rmoff and others, without this you're not really "git for data".

gburd avatar Mar 02 '23 14:03 gburd

This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.

github-actions[bot] avatar Nov 01 '23 14:11 github-actions[bot]

Closing this issue because it has been stale for 7 days with no activity.

github-actions[bot] avatar Nov 12 '23 01:11 github-actions[bot]

One use-case from a user on Slack (thanks!): a Delta (or Iceberg) writer could use a partial commit. First put the catalog file if absent, now commit only the data files which that catalog version uses.

This achieves blockstore-level write concurrency - other content writers just need to write the catalog and can implement full commits.

arielshaqed avatar Apr 09 '24 15:04 arielshaqed

Use-case: our team develop a low-code platform that consists of design time and run time. Design time is a Web IDE where a user develop bpm schemas, integration, data models and etc. All these artifacts are just files and by design user has ability to track all file. Web IDE will be provided as PaaS so we decided to use likeFS as git (classic VCS). So committing a subset of staging area is critical feature for us. Absence of it, as mentioned above, indeed breaks mental model of using git

lookeme avatar May 13 '24 14:05 lookeme