gcsfs icon indicating copy to clipboard operation
gcsfs copied to clipboard

Uploading files fails when using transactions

Open vringar opened this issue 3 years ago • 1 comments

What happened: When trying to upload to upload a file with the following code and gcsfs version 2021.4.0 I get the following error:

gcsfs.utils.HttpError: Invalid request.  The number of bytes uploaded is required to be equal or greater than 262144, except for the final request (it's recommended to be the exact multiple of 262144).  The received request contained 21 bytes, which does not meet this requirement., 400

What you expected to happen:

I would expect the file to be written to the bucket and no exception to be thrown.

Minimal Complete Verifiable Example:

from gcsfs import GCSFileSystem

def main():
    file_system = GCSFileSystem(
            project="<our_project_name>", access="read_write"
        )
    path = "<our_test_bucket>/v15-test/test_file"

    file_system.start_transaction()


    with file_system.open(path, mode="wb") as f:
            f.write(b"This is a test string")

    file_system.end_transaction()


if __name__ == "__main__":
    main()

Anything else we need to know?:

The upload succeeds when not using transactions. However we can't just remove the transaction in our production code as we run preemptible instances and want to be really sure that we have uploaded the files.

Environment:

  • Dask version: None
  • Python version: Python 3.9.2
  • Operating System: Linux
  • Install method (conda, pip, source): conda

vringar avatar May 07 '21 11:05 vringar

Hm, I see that this is a design flaw - the "commit"/"abort" functionality assumes multi-part uploads, which have this size requirement. For smaller files, it would amount to keeping the written data in memory, which would be fine but require some work to fit into the current code.

Note that, since this is a small amount of data, the write is effectively atomic. You could write many files' data in one go using pipe, e.g.,

file_system.open(
    {path: b"This is a test string",
     path2: b"Another bytestring"}
)

This is also as atomic as is possible with a remote store.

martindurant avatar May 07 '21 17:05 martindurant

This is still an issue with the latest fsspec==2023.9.2 and gcsfs==2023.9.2, is there a workaround or are there plans to address this? Our script uses url_to_fs to keep everything behind the FileSystem interface, and so far gcsfs is the only backend which breaks with transactions.

A minimal repro for my use case:

from fsspec.core import url_to_fs

fs, url = url_to_fs('gs://<my_bucket>/<path>')
with fs.transaction:
  with fs.open(f'{url}/my_file', 'wb') as f:
    f.write(b'This is a test string')

I would prefer not to abandon transactions entirely, but GCS support is critical for the project. Please let me know if there are any updates on this.

jonb377 avatar Oct 05 '23 20:10 jonb377

Do you still have the same problem if you use

fs = fsspec.filesystem("gcs")

?

martindurant avatar Oct 05 '23 21:10 martindurant

Hey Martin, thanks for the quick response! Unfortunately yes, here is the script I ran:

import fsspec

fs = fsspec.filesystem('gcs')
with fs.transaction:
    with fs.open('<my_bucket>/foo', 'wb') as f:
        f.write(b'This is a test string')

I see the same gcsfs.retry.HttpError: Invalid request. The number of bytes uploaded is required to be equal or greater than 262144, except for the final request (it's recommended to be the exact multiple of 262144). The received request contained 21 bytes, which does not meet this requirement., 400 error

jonb377 avatar Oct 05 '23 21:10 jonb377

OK, so it seems like this was never working... The way it is implemented is to start a multi-part upload, which can then be confirmed or cancelled, and it's this method that has the chunk size limit. I'm not sure why it would fail with just the one small chunk, though - will look.

martindurant avatar Oct 06 '23 13:10 martindurant

Actually, it does work on the GCS emulator, so maybe the API changed.

martindurant avatar Oct 06 '23 13:10 martindurant

Please check with #586

martindurant avatar Oct 06 '23 17:10 martindurant

Wow thank you @martindurant, this works perfectly! Really appreciate the quick turnaround on this 😄

jonb377 avatar Oct 06 '23 19:10 jonb377

@martindurant one last question - what does the release timeline look like for the fix?

jonb377 avatar Oct 06 '23 20:10 jonb377

Within two weeks?

martindurant avatar Oct 06 '23 20:10 martindurant