skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

[storage] storage upload failing due to "Argument list too long"

Open SeungjinYang opened this issue 8 months ago • 13 comments

Image

/bin/bash: /opt/homebrew/bin/aws: Argument list too long

Investigate the following error message when uploading files to cloudflare R2 storage

SeungjinYang avatar Apr 18 '25 20:04 SeungjinYang

Tried reproducing this with:

file_mounts:
  /cloudflare:
    name: <bucket-name>
    source: ~/yamls
    store: r2
    mode: MOUNT

This command actually errors out for me with upload failed: ../../yamls/cloudflare.yaml to s3://<bucket-name>/cloudflare.yaml An error occurred (InternalError) when calling the PutObject operation (reached max retries: 2): We encountered an internal error. Please try again.

but I am not getting the error described in the issue.

SeungjinYang avatar Apr 18 '25 20:04 SeungjinYang

The error message above seems to be due to an unrelated issue which should be fixed with https://github.com/skypilot-org/skypilot/pull/5282. Running the command with the PR branch does not show any issues.

SeungjinYang avatar Apr 18 '25 20:04 SeungjinYang

Found an interesting article from the web: https://unix.stackexchange.com/questions/353111/argument-list-is-too-long-for-bin-aws

So it seems like the argument list is literally too long - maybe the directory has too many files or something. I'll see if I can reproduce this.

SeungjinYang avatar Apr 18 '25 21:04 SeungjinYang

https://unix.stackexchange.com/questions/45583/argument-list-too-long-how-do-i-deal-with-it-without-changing-my-command https://github.com/aws/aws-cli/issues/4486 Long story short, a shell command can fail with Argument list too long if the length of arguments (by number of characters, no the number of actual arguments) is too long.

            sync_command = (
                'AWS_SHARED_CREDENTIALS_FILE='
                f'{cloudflare.R2_CREDENTIALS_PATH} '
                f'aws s3 sync --no-follow-symlinks {excludes} '
                f'{src_dir_path} '
                f's3://{self.name}{sub_path}/{dest_dir_name} '
                f'--endpoint {endpoint_url} '
                # R2 does not support CRC64-NVME
                # which is the default for aws s3 sync
                # https://community.cloudflare.com/t/an-error-occurred-internalerror-when-calling-the-putobject-operation/764905/13
                f'--checksum-algorithm CRC32 '
                f'--profile={cloudflare.R2_PROFILE_NAME}')

For reference, this is the code snippet to generate the directory sync command.

Not saying this is how it happened for this user specifically, but I could see, for example, the excludes list being really really long and causing the whole command to be too long. I will try to confirm this behavior.

cc. @cg505 for opinions on how this might potentially be solved if it indeed is the excludes being too long - I don't actually have a good idea for a solution if that turns out to be what is happening.

SeungjinYang avatar Apr 18 '25 22:04 SeungjinYang

I don't have any genius insight, since it sounds like the fundamental issue can't really be worked around. A few quick thoughts:

  1. We should double check we aren't doing something stupid like separately excluding every single file if we ignore a directory with a lot of files.
  2. We should give a better error message. Presumably we can check the argument list against the ulimit before running the command. We can also maybe suggest a ulimit -s command as a fix if it works, otherwise ask the user to create a .skyignore with fewer entries.
  3. We could add some logic that splits the command into multiple invocations (e.g. for subdirs) and only include the excludes that are relevant to the specific invocation. I'm pretty skeptical of this idea though. It sounds extremely complicated and is not actually guaranteed to fix every possible case.
  4. Don't use aws s3 for syncing files. E.g. rsync has a --exclude-from=FILE param that would solve this elegantly, but aws s3 does not have it.

cg505 avatar Apr 18 '25 23:04 cg505

Checked and, yep- it was just a long skyignore blowing up the length of the aws cli command. This is probably an important thing to log to the user or at least mention in the docs since my first instinct was just to dump my (very long) gitignore into .skyignore and make a few tweaks.

turtlebasket avatar Apr 19 '25 05:04 turtlebasket

It seems we're somewhat in luck, we already have a pattern elsewhere for handling exactly this kind of problem

https://github.com/skypilot-org/skypilot/blob/c975eab356a7b927a63320cf45452c5d3562db87/sky/backends/cloud_vm_ray_backend.py#L3470-L3482

i.e. just dump the command into a file and run that file as a bash script

So presumably we can use the same trick to deal with "Argument list too long" error here as well.

@turtlebasket let me know if you'd like to give it a go here, else I'll merge the fix in within the next two weeks or so.

SeungjinYang avatar Apr 21 '25 17:04 SeungjinYang

just dump the command into a file and run that file as a bash script

We can try this but I don't think it will work. bash is subject to the same underlying restriction to the command length, since it's an OS-level limit on argv basically. At least that's my understanding. I would love to be wrong!

Edit: Maybe I am misunderstanding something given the code you linked, but my guess is that it's a restriction with SSH rather than the restriction we're hitting.

cg505 avatar Apr 21 '25 18:04 cg505

I tried reproducing it in a simple and dirty way. I made this file

argmax.sh

echo "really really really... long command"
$ ls -l argmax.sh
-rw-r--r--@ 1 <user>  <group>  3868280 Apr 21 11:19 argmax
$ getconf ARG_MAX
1048576 # my command is longer than ARG_MAX
$ bash argmax.sh # works

Then modified the file to be argmax.sh

/bin/bash echo "really really really... long command"
$ bash argmax.sh
argmax.sh: line 1: /bin/bash: Argument list too long

so it should work

SeungjinYang avatar Apr 21 '25 18:04 SeungjinYang

wow! I'm very surprised that works, but great. We can try to do this for the s3 command then and see if it works.

cg505 avatar Apr 21 '25 21:04 cg505

Update: yes, this indeed does not work. The reason the step above worked is because "echo" is a terminal builtin command and behaves differently.

SeungjinYang avatar Apr 24 '25 21:04 SeungjinYang

I got the same issue. It seems like the ignore function is totally broken, no? Does the AWS CLI really not support passing directories without expanding to every single file?

jbohnslav avatar Apr 30 '25 04:04 jbohnslav

FYI #6115

brodyh avatar Jun 30 '25 21:06 brodyh

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Oct 29 '25 02:10 github-actions[bot]

This issue was closed because it has been stalled for 10 days with no activity.

github-actions[bot] avatar Nov 09 '25 02:11 github-actions[bot]