[storage] storage upload failing due to "Argument list too long"
/bin/bash: /opt/homebrew/bin/aws: Argument list too long
Investigate the following error message when uploading files to cloudflare R2 storage
Tried reproducing this with:
file_mounts:
/cloudflare:
name: <bucket-name>
source: ~/yamls
store: r2
mode: MOUNT
This command actually errors out for me with
upload failed: ../../yamls/cloudflare.yaml to s3://<bucket-name>/cloudflare.yaml An error occurred (InternalError) when calling the PutObject operation (reached max retries: 2): We encountered an internal error. Please try again.
but I am not getting the error described in the issue.
The error message above seems to be due to an unrelated issue which should be fixed with https://github.com/skypilot-org/skypilot/pull/5282. Running the command with the PR branch does not show any issues.
Found an interesting article from the web: https://unix.stackexchange.com/questions/353111/argument-list-is-too-long-for-bin-aws
So it seems like the argument list is literally too long - maybe the directory has too many files or something. I'll see if I can reproduce this.
https://unix.stackexchange.com/questions/45583/argument-list-too-long-how-do-i-deal-with-it-without-changing-my-command
https://github.com/aws/aws-cli/issues/4486
Long story short, a shell command can fail with Argument list too long if the length of arguments (by number of characters, no the number of actual arguments) is too long.
sync_command = (
'AWS_SHARED_CREDENTIALS_FILE='
f'{cloudflare.R2_CREDENTIALS_PATH} '
f'aws s3 sync --no-follow-symlinks {excludes} '
f'{src_dir_path} '
f's3://{self.name}{sub_path}/{dest_dir_name} '
f'--endpoint {endpoint_url} '
# R2 does not support CRC64-NVME
# which is the default for aws s3 sync
# https://community.cloudflare.com/t/an-error-occurred-internalerror-when-calling-the-putobject-operation/764905/13
f'--checksum-algorithm CRC32 '
f'--profile={cloudflare.R2_PROFILE_NAME}')
For reference, this is the code snippet to generate the directory sync command.
Not saying this is how it happened for this user specifically, but I could see, for example, the excludes list being really really long and causing the whole command to be too long. I will try to confirm this behavior.
cc. @cg505 for opinions on how this might potentially be solved if it indeed is the excludes being too long - I don't actually have a good idea for a solution if that turns out to be what is happening.
I don't have any genius insight, since it sounds like the fundamental issue can't really be worked around. A few quick thoughts:
- We should double check we aren't doing something stupid like separately excluding every single file if we ignore a directory with a lot of files.
- We should give a better error message. Presumably we can check the argument list against the ulimit before running the command. We can also maybe suggest a
ulimit -scommand as a fix if it works, otherwise ask the user to create a.skyignorewith fewer entries. - We could add some logic that splits the command into multiple invocations (e.g. for subdirs) and only include the excludes that are relevant to the specific invocation. I'm pretty skeptical of this idea though. It sounds extremely complicated and is not actually guaranteed to fix every possible case.
- Don't use
aws s3for syncing files. E.g.rsynchas a--exclude-from=FILEparam that would solve this elegantly, butaws s3does not have it.
Checked and, yep- it was just a long skyignore blowing up the length of the aws cli command. This is probably an important thing to log to the user or at least mention in the docs since my first instinct was just to dump my (very long) gitignore into .skyignore and make a few tweaks.
It seems we're somewhat in luck, we already have a pattern elsewhere for handling exactly this kind of problem
https://github.com/skypilot-org/skypilot/blob/c975eab356a7b927a63320cf45452c5d3562db87/sky/backends/cloud_vm_ray_backend.py#L3470-L3482
i.e. just dump the command into a file and run that file as a bash script
So presumably we can use the same trick to deal with "Argument list too long" error here as well.
@turtlebasket let me know if you'd like to give it a go here, else I'll merge the fix in within the next two weeks or so.
just dump the command into a file and run that file as a bash script
We can try this but I don't think it will work. bash is subject to the same underlying restriction to the command length, since it's an OS-level limit on argv basically. At least that's my understanding. I would love to be wrong!
Edit: Maybe I am misunderstanding something given the code you linked, but my guess is that it's a restriction with SSH rather than the restriction we're hitting.
I tried reproducing it in a simple and dirty way. I made this file
argmax.sh
echo "really really really... long command"
$ ls -l argmax.sh
-rw-r--r--@ 1 <user> <group> 3868280 Apr 21 11:19 argmax
$ getconf ARG_MAX
1048576 # my command is longer than ARG_MAX
$ bash argmax.sh # works
Then modified the file to be argmax.sh
/bin/bash echo "really really really... long command"
$ bash argmax.sh
argmax.sh: line 1: /bin/bash: Argument list too long
so it should work
wow! I'm very surprised that works, but great. We can try to do this for the s3 command then and see if it works.
Update: yes, this indeed does not work. The reason the step above worked is because "echo" is a terminal builtin command and behaves differently.
I got the same issue. It seems like the ignore function is totally broken, no? Does the AWS CLI really not support passing directories without expanding to every single file?
FYI #6115
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This issue was closed because it has been stalled for 10 days with no activity.