lakeFS icon indicating copy to clipboard operation
lakeFS copied to clipboard

[Bug]: lakectl fs upload Causes OOM Error During Large File Uploads

Open andrijdavid opened this issue 1 year ago • 2 comments

What happened?

lakectl fs upload command causes an Out-Of-Memory (OOM) error, resulting in the process being killed by the kernel or freezing the OS during the upload of large files.

Environment:

  • Memory: 16GB
  • Storage: Local system with multiple folders, each ~200GB, hosted locally.
  • Repository: Google Cloud Storage (GCS)

Steps to Reproduce:

  • Attempt to upload a large directory with multiple small files using lakectl fs upload: lakectl fs upload --source . --recursive "lakefs://${LAKEFS_REPO_NAME}/${DEFAULT_BRANCH}/" --pre-sign -p 8 Reducing p doesn't solve the issue
  • Observe that the system either encounters an OOM error or freezes.

Expected behavior

File uploaded successfully

lakeFS version

1.31.1

How lakeFS is installed

GCP

Affected clients

All

Relevant log output

22352 Killed lakectl fs upload --source . --recursive "lakefs://${LAKEFS_REPO_NAME}/${DEFAULT_BRANCH}/" --pre-sign -p 8

Contact details

No response

andrijdavid avatar Aug 21 '24 09:08 andrijdavid

The upload mechanism for both pre-signed URLs and direct uploads to the client buffers data in memory, which is not ideal for large files and triggers OOM when uploading big files.

https://github.com/treeverse/lakeFS/blob/08fbdf21794ce61f4615a4e8f53248b1014d51fe/cmd/lakectl/cmd/fs_upload.go#L98

https://github.com/treeverse/lakeFS/blob/08fbdf21794ce61f4615a4e8f53248b1014d51fe/pkg/api/helpers/upload.go#L40

andrijdavid avatar Aug 24 '24 19:08 andrijdavid

We also got "lakectl" killed by the local host kernel, because it was trying to use more memory than was available (not using the "-p" option).

On a computer with 32GB ram (with 15GB already taken by other processes), we were finally able to commit 7.45GB binary files with "-p 1".

We think we will not be able to ingest larger binary files.

It seems that for binary files of 7GB, lakectl needs a little more than 2x the memory space of the large binary files per concurrent process requested (if do not specify "-p" the default seems to be 25).

ex.: if p=8, and the folder contains all 10GB binary files, we should expect "lakectl" requiring 8x10x2 = 160GB of RAM to avoid being killed when trying to upload (commit) the folder.

Is that right ? Are there options, or plans to allow ingestion of large binary files (larger files than the computer ram) ?

dvnicolasdh avatar Sep 14 '24 23:09 dvnicolasdh

Hi @andrijdavid,

A couple of questions:

  1. What's the max size of each object in the directories you are trying to upload?
  2. Do you get the same error when uploading a single file?
  3. Do you get the same error when running with --pre-sign=false?
  4. What OS do you use?

Sorry to bother you, but I want to understand the exact issue you faced, as there are many options.

idanovo avatar Oct 28 '24 15:10 idanovo

@andrijdavid @dvnicolasdh Thanks for reporting this issue, I think we found the issue; It's related to a bug in the [go-retryablehttp package we use. It reads files into the memory instead of streaming it.

As a temporary workaround, till we release a new version with a fix, you can set lakect not to use the retryable client by:

  1. Adding this to your lakectl.yaml file
server: 
  retries: 
    enabled: false

Or 2) Running lakectl with this env var LAKECTL_SERVER_RETRIES_ENABLED=false

Can you please try this and let me know if it solved your issue?

idanovo avatar Oct 29 '24 17:10 idanovo

@andrijdavid @dvnicolasdh the issue was solved and the fix will be introduced in the next lakeFS version that will be released. Thanks again for reporting this one.

idanovo avatar Nov 09 '24 18:11 idanovo