s5cmd icon indicating copy to clipboard operation
s5cmd copied to clipboard

Using GCS multipart upload is not supported.

Open s4mur4i opened this issue 4 years ago • 4 comments

Uploading files larger then the partition size will cause s5cmd to use multipart uploads, which is not supported. Current work around is to set:

-c=1 -p=1000000

for copying to prevent sharding.

Error is also misleading:

InvalidArgument: Invalid argument. status code: 400, request id: , host id:

s4mur4i avatar Aug 21 '20 14:08 s4mur4i

Encountering this issue as well, specifically when uploading to a GCS bucket. The following fails with the misleading InvalidArgument error:

s5cmd --endpoint-url https://storage.googleapis.com cp test_30GB_file s3://test-gcs-bucket/

while the following succeeds:

s5cmd --endpoint-url https://storage.googleapis.com cp -c=1 -p=1000000 test_30GB_file s3://test-gcs-bucket/

doit-mattporter avatar Aug 31 '20 19:08 doit-mattporter

Thanks for the report.

s5cmd treats S3 as a first class citizen because it's been designed to communicate with S3. Naturally, it's using the official AWS SDK to communicate with S3. s5cmd can access GCS through its S3-compatible (well, mostly) API Gateway.

If you try to upload a file to an object store, s5cmd will split the file and upload the parts in parallel to achieve maximum throughput, using the S3 Multipart Upload API contract. The problem is, GCS doesn't have multipart upload support in their S3-compatible API. The misleading error you're encounting is the result of the lack of this support.

An excerpt from the official GCS docs: Screen Shot 2020-09-01 at 11 55 55

So, multipart upload is not supported out of the box. If you use -c=1 -p=<number-higher-than-filesize-in-mb>, SDK is going to upload the file in one pass by using the PutObject API call, which is supported by GCS. For larger files though, the operation will not be as performant as multipart upload.

igungor avatar Sep 01 '20 09:09 igungor

Native s5cmd support for GCP could be massively beneficial for that cloud as GCP's native gsutil tooling offers very poor download performance compared to s5cmd. gsutil upload performance is not great either, however this can't be improved by using s5cmd until multi-part upload support is added. See this benchmarking deep-dive I wrote on the subject: https://blog.doit-intl.com/optimize-data-transfer-between-compute-engine-and-cloud-storage-9a1ecd030e30

doit-mattporter avatar Sep 13 '21 17:09 doit-mattporter

Is this still an existing issue?

rosibaj avatar May 09 '22 22:05 rosibaj

GCS supports it now https://stackoverflow.com/questions/27830432/google-cloud-storage-support-of-s3-multipart-upload

jordan-jack-schneider avatar Mar 08 '23 19:03 jordan-jack-schneider

It seems that GCS now supports both ListObjectsV2 and S3 multipart upload protocols according to their changelog.

I can't test it now but if someone could test and report, that'd be very helpful. I'm closing the issue. Please feel free to re-open if you see any problem with GCS multipart uploads.

igungor avatar Jul 25 '23 12:07 igungor