s5cmd
s5cmd copied to clipboard
Using GCS multipart upload is not supported.
Uploading files larger then the partition size will cause s5cmd to use multipart uploads, which is not supported. Current work around is to set:
-c=1 -p=1000000
for copying to prevent sharding.
Error is also misleading:
InvalidArgument: Invalid argument. status code: 400, request id: , host id:
Encountering this issue as well, specifically when uploading to a GCS bucket. The following fails with the misleading InvalidArgument error:
s5cmd --endpoint-url https://storage.googleapis.com cp test_30GB_file s3://test-gcs-bucket/
while the following succeeds:
s5cmd --endpoint-url https://storage.googleapis.com cp -c=1 -p=1000000 test_30GB_file s3://test-gcs-bucket/
Thanks for the report.
s5cmd
treats S3 as a first class citizen because it's been designed to communicate with S3. Naturally, it's using the official AWS SDK to communicate with S3. s5cmd
can access GCS through its S3-compatible (well, mostly) API Gateway.
If you try to upload a file to an object store, s5cmd will split the file and upload the parts in parallel to achieve maximum throughput, using the S3 Multipart Upload API contract. The problem is, GCS doesn't have multipart upload support in their S3-compatible API. The misleading error you're encounting is the result of the lack of this support.
An excerpt from the official GCS docs:
So, multipart upload is not supported out of the box. If you use -c=1 -p=<number-higher-than-filesize-in-mb>
, SDK is going to upload the file in one pass by using the PutObject API call, which is supported by GCS. For larger files though, the operation will not be as performant as multipart upload.
Native s5cmd support for GCP could be massively beneficial for that cloud as GCP's native gsutil tooling offers very poor download performance compared to s5cmd. gsutil upload performance is not great either, however this can't be improved by using s5cmd until multi-part upload support is added. See this benchmarking deep-dive I wrote on the subject: https://blog.doit-intl.com/optimize-data-transfer-between-compute-engine-and-cloud-storage-9a1ecd030e30
Is this still an existing issue?
GCS supports it now https://stackoverflow.com/questions/27830432/google-cloud-storage-support-of-s3-multipart-upload
It seems that GCS now supports both ListObjectsV2 and S3 multipart upload protocols according to their changelog.
I can't test it now but if someone could test and report, that'd be very helpful. I'm closing the issue. Please feel free to re-open if you see any problem with GCS multipart uploads.