lakeFS
lakeFS copied to clipboard
Move or copy datasets inside a repo
Created at
2023-11-10T17:46:02.000Z
Priority
priority:unknown
PRD
https://github.com/treeverse/lakeFS/issues/6015
Feature Definition
T-shirt size
Related feature requests
I just had to re-organize some data. Specifically, moving all the assets from one directory to a different directory. On the local file system it was a simple mv dir_a sub/dir/
and ran essentially instantly. But, to do that I had to (1) lakectl local clone
part of my repo (2) do the move and (3) lakectl local commit
. In all that took well over and hour as it was a few hundred GB of data. I believe it should be possible to do this in O(1) via lakectl fs mv lakefs://.../dir_a lakefs://.../sub/dir/
without needing to clone or re-upload.
Note: I tried the aws s3 mv
route as described in the original feature request and it works but is incredibly slow. It have about 4TB of data I'd like to move (just change directory path) and it appears that's going to take about 5 hours.
Update: it ran for about 12 hours and then started randomly failing on some files. Attempts to retry failed on the same files.
I think doing a rename like this server side is probably a single SQL UPDATE query since the metadata is all stored in Postgres and the actual data blobs don't need to move at all.
Technical response on why things are the way they are
Details of internals, feel free to skip!
This comment:
I think doing a rename like this server side is probably a single SQL UPDATE query since the metadata is all stored in Postgres and the actual data blobs don't need to move at all.
is really important in order to understand why it's non-trivial. Because it is almost entirely true! Initial versions of lakeFS supported moves.
The thing is, an RDBMS such as Postgres won't scale to desired performance levels. ACID transactions would make this so much simpler! But reads become about as slow as writes - the database must ensure that there is no hazard or similar from a concurrent write.
Instead we use a key-value store for lakeFS staging. That database model only allows for single-key concurrency, obviously here we would need consistency across 2 keys. The actual place where we lose is that lakeFS cannot safely garbage-collect uncommitted objects.
Actual question about requirements
While we cannot safely accommodate all scenarios, perhaps there are some scenarios which we could?
As an example, committed branch HEADs are immune to garbage collection, so it may be possible to support a rename operation from a committed object on branch head to a new name on that same branch. And other scenarios may also be possible.
With that in mind, users could help us narrow down the scope! Right now I am asking where you can limit:
- Can the source always be a committed object? (For a move that obviously implies that that object is on the branch head.)
- If uncommitted, can the source always be new-ish, and this not eligible for garbage collection?
- Is it acceptable for the operation to try to move safely, but fail if a safe move is impossible?
I am looking for a way to limit what sources are allowed, in some way which will allow us to add a safe API. The worry if course is that such an API will be unusably complex.
Of course this is not a design, and we might not be able to accommodate your important scenarios. I would like to understand actual requirements in detail.
Details of internals, feel free to skip!
Thanks! I appreciate having a better understanding of why this is hard.
Can the source always be a committed object? (For a move that obviously implies that that object is on the branch head.)
Yes, I think that's fine.
If uncommitted, can the source always be new-ish, and this not eligible for garbage collection?
I think only comitted is fine but I can see a case for recent uncommitted objects off a branch head (e.g. "oops - that was a typo" for a newly created object)
Is it acceptable for the operation to try to move safely, but fail if a safe move is impossible?
Yes, especially if the error message is clear. Like "Unable to safely move but we can always safely move committed object. Try comitting to a branch and then moving".
I would like to understand actual requirements in detail.
In my particular case it's a simple case of data organization. We put a bunch of files in /some/directory/path
and then realized a better organization would be /some/directory/with-intermediate/path
or /other/location/entirely
. On a local file system my use case (so far) has pretty much always boiled down to a single mv
of a directory and that directory was always something that was comitted to the HEAD of a branch.
Thanks for considering this!