lakeFS icon indicating copy to clipboard operation
lakeFS copied to clipboard

Add `apply()` function to `Change` model.

Open easontm opened this issue 4 months ago • 7 comments

One thing I'd like to do in LakeFS is set some files to a specific point in time. Functionally, a data rollback for a subset of my data. In git, I would normally use checkout with a commit and file name, then add, commit, then push. Since checkout doesn't exist as a concept in LakeFS, an acceptable alternative would be if I could do the equivalent of git diff $commit1 $commit2 > my.patch, git apply my.patch. I'm imagining some python like this

ref1 = Reference(...)
ref2 = Reference(...)
diff = ref1.diff(ref2)
for change in diff:
    change.apply()

In my specific use case, I would supply a value to the prefix arg of diff() and that would let me only apply changes regarding a specific subset of my data.

easontm avatar Aug 06 '25 01:08 easontm

How much data are you restoring? Specifically are you fine with round-tripping the data through your machine, which both of your examples do? Asking because not downloading and re-uploading is tricky when GC is running concurrently.

arielshaqed avatar Aug 06 '25 03:08 arielshaqed

I was trying to avoid round tripping the data because all of our LakeFS python operations occur on k8s pods, which require volume provisioning ahead of time. I know I can accomplish this by calling the reader on output of Reference(revert_commit).objects() and reuploading to a new branch and merging.

What do you mean by "both examples [do downloading and reuploading]"? Obviously the git example does, but as far as I can tell diff() does not and this fictional apply() does not exist. Are you saying its implementation would necessarily require this?

The reason I theorized that this could be implemented without asking the client to download/upload is that the LakeFS instance can already calculate diffs, create and track uncomitted changes, create commits, and merge -- all from the UI.

easontm avatar Aug 07 '25 02:08 easontm

From my understanding, the use case here is a repo that has a few folders, and you would want to roll back only a specific folder. Is there a different way to do this that is easier without expansion of the feature?

iddoavn avatar Aug 20 '25 00:08 iddoavn

Other than the full object download/upload, the only other method I've theorized is doing a "purge" commit (where I delete all objects in a folder) before every merge, so that way the merge has the full diff of the files I want and I can cherry pick. However, that doubles the size of my mainline commit list and all the additions are essentially garbage.

easontm avatar Aug 20 '25 00:08 easontm

I understand the workarounds above might be too costly. We have discussed internally and are trying to find the path of least resistance - Would it be possible to isolate the path you care about into a separate repository? Combined with storing metadata in the commits to link the two repository versions, this approach might solve the problem.

iddoavn avatar Sep 02 '25 18:09 iddoavn

One advantage of doing this is that it creates a useful plumbing API and command. For instance I can imagine scripting a "rebase" command like this. Or, with some scope creep, it could support more programmable export and import commands. The challenging bit would be to design a maximally useful API without making it too large.

arielshaqed avatar Sep 02 '25 19:09 arielshaqed

@easontm I think you are fine with this workaround for the time being: use Python reader to read one file at a time, keep it in memory, write file to a new branch and merge.

kesarwam avatar Sep 04 '25 16:09 kesarwam