s3fs Incorrect MultiPartUpload Chunksize

Description:

I have encountered an issue with the MultiPartUpload functionality in the s3fs library where the chunk size used during upload appears incorrect.

Environment Information:

s3fs Version: 2023.9.1
Python Version: 3.9
Operating System: Ubuntu 22.04

Issue Details:

When using s3fs to upload large files to an S3 bucket on Cloudflare R2, I noticed that the chunk size used for MultiPartUpload needs to be consistent.

Expected Behavior:

I expect s3fs to use the same chunk size for files with the body(b"1" * 5 * 2**30) + b"kek" and b"1" * 5 * 2**30

Steps to Reproduce:

Working example:

import io

with fsspec.open('s3://path', mode='wb') as f:
    buffer = io.BytesIO(b"1" * 5 * 2**30)
    f.write(buffer.getvalue())

This code works perfectly.

But if we change the buffer size to:

with fsspec.open('s3://path', mode='wb') as f:
    buffer = io.BytesIO((b"1" * 5 * 2**30) + b"kek" )
    f.write(buffer.getvalue())

This code gives the error: ClientError: An error occurred (InvalidPart) when calling the CompleteMultipartUpload operation: All non-trailing parts must have the same length.

Many thanks for considering my request.

Sep 18 '23 11:09 alex-kharlamov

That is most unfortunate - AWS S3 doesn't have this limitation so long as each chunk is big enough. An S3File could in theory be configured to always send a specific size (in _upload_chunk), and retain the remainder in the buffer, but it would be annoying to code and only get used by niche backends that need it (perhaps only R2).

Sep 18 '23 13:09 martindurant

FWIW: arrow made a fix for this https://github.com/apache/arrow/issues/34363 FWIW2: according to this comment, it sounds unlikely that Cloudflare will change their implementation

Sep 26 '23 23:09 plaflamme

it sounds unlikely that Cloudflare will change their implementation

It would be nice, but the comment doesn't promise anything in the near term.

Sep 27 '23 13:09 martindurant

I had the same issue, and when I added a debug line after https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L2269 to print out the data1_size, it prints out the value which is smaller than the self.blocksize Also the values of each chuck are different besides the last final one, e.g, 1486 and 1371, even the blocksize is default 5242880. So the following if logic changes the data1, which generates different sizes of parts.

Any idea why the read to data1 doesn't return the blocksize if it's not the last part?

Dec 06 '23 19:12 chongzhang

Any idea why the read to data1 doesn't return the blocksize

In general, read() is not required to return all the bytes you request, but I don't see why an io.Bytes would ever return less.

Dec 06 '23 19:12 martindurant

Could it be related to the cache? I tried different cache options but still got the same error.

Dec 07 '23 18:12 chongzhang

Could it be related to the cache?

Which cache, what do you mean?

Dec 07 '23 19:12 martindurant

https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L2020 https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L1568 TBH I am new to s3fs and not sure it's related to this different chunk size.

Dec 07 '23 20:12 chongzhang

You have linked to two class definitions? Both have default block size of 5 * 2**20, but _upload_chunk refers consistently to self.blocksize (i.e., only one singular value)

Dec 07 '23 20:12 martindurant

I'd be interested in helping to get this fixed as we're running into this in production with llama_index.

Dec 14 '23 09:12 matthiaskern

@matthiaskern , all help is welcome

Dec 14 '23 14:12 martindurant

Ran into this today as well fwiw.

Jul 05 '24 08:07 hantusk

OK, so to summarise:

Currently, s3fs's file will flush its whole buffer, whatever the size, whenever a write() puts it over the block size. This is fine with AWS S3 (and minio and others), but R2 requires each part to be the same size.

The solution, is to allow/require S3File to always push exactly one block size at a time, potentially needing multiple remote writes at flush time, and leaving some buffer data over.

Since the remote call happens only in one place, this shouldn't be hard to code up. Does someone want to take it on? It should not split writes by default, where variable part sizes are allowed.

Jul 05 '24 14:07 martindurant

s3fs s3fs copied to clipboard

Incorrect MultiPartUpload Chunksize

s3fs
s3fs copied to clipboard