lakeFS
lakeFS copied to clipboard
Add a catch/retry mechanism to lakectl local operations
This enhancement request stems from a bug report about rare failures of lakectl local commits.
@itaiad200 is this issue https://github.com/treeverse/lakeFS/issues/7705 closing this item?
I just ran into an issue where I was running lakectl local clone on about 4TB of data. It was about 1/2 done when one download failed with:
download <path> failed: could not write file '/mnt/data/frames-bbox/<file>': stream error: stream ID 167; INTERNAL_ERROR; received from peer
(I've redacted the actual file path as it contains the name of a confidential partner). The error is likely because the destination volume is an NFS volume and there was some kind of network error.
As a result I had to then cd into that directory and lakectl local pull which first required a diff of what it had pulled and what was in the cloud. As that was about 2TB of data and due to this issue that diff takes a very long time. It's still running now but I suspect it's going to be about the same amount of time as just deleting everything and trying the clone again from scratch.
I'm not sure but I think https://github.com/treeverse/lakeFS/pull/7723 fixed issues with communication with the LakeFS server but I don't think it addressed issues like the above.
In addition to this being a bit slow, manual, and annoying it can be a really significant issue in some cases. We sometimes launch ML training jobs as Argo workflows. The first step in that workflow is to download the relevant training data to faster, local storage. If that fails the entire workflow fails. The initial clone can take about 8 hours so that's a significant issue if it fails after 6 or 7 hours and we have to restart the job.
Just had another one of these after hours of data download. This time it wasn't an NFS volume.
download<path>/camera_b_raw/frames.h5 failed: could not write file '/mnt/data/frames_bbox/<frames>/camera_b_raw/frames.h5': stream ID 407; INTERNAL_ERROR; received from peer
│ Error executing command.
As above, I've redacted the full filename a bit.
Like @oliverdain mentioned, #7723 handles retries with lakeFS requests, and won't address issues like the above, where the response failed to be written to the destination file (here).
Like the title suggests, in order to solve this we would have to retry the entire apply flow in situations where it's not lakeFS server requests failing (to avoid multiplying the retries).
I just had another data transfer fail with:
download <redacted>/camera_b_raw/frames.h5 failed: could not write file '/<redacted>/camera_b_raw/frames.h5': stream error: stream ID 325; INTERNAL_ERROR; received from peer
This time I'm using a regular GKE persistent disk not an NFS volume so I think that error is coming from the LakeFS server. This is now the 3rd time in a row my data transfer has failed after about 10 hours.