gcsfs
gcsfs copied to clipboard
Uploading files fails when using transactions
What happened: When trying to upload to upload a file with the following code and gcsfs version 2021.4.0 I get the following error:
gcsfs.utils.HttpError: Invalid request. The number of bytes uploaded is required to be equal or greater than 262144, except for the final request (it's recommended to be the exact multiple of 262144). The received request contained 21 bytes, which does not meet this requirement., 400
What you expected to happen:
I would expect the file to be written to the bucket and no exception to be thrown.
Minimal Complete Verifiable Example:
from gcsfs import GCSFileSystem
def main():
file_system = GCSFileSystem(
project="<our_project_name>", access="read_write"
)
path = "<our_test_bucket>/v15-test/test_file"
file_system.start_transaction()
with file_system.open(path, mode="wb") as f:
f.write(b"This is a test string")
file_system.end_transaction()
if __name__ == "__main__":
main()
Anything else we need to know?:
The upload succeeds when not using transactions. However we can't just remove the transaction in our production code as we run preemptible instances and want to be really sure that we have uploaded the files.
Environment:
- Dask version: None
- Python version: Python 3.9.2
- Operating System: Linux
- Install method (conda, pip, source): conda
Hm, I see that this is a design flaw - the "commit"/"abort" functionality assumes multi-part uploads, which have this size requirement. For smaller files, it would amount to keeping the written data in memory, which would be fine but require some work to fit into the current code.
Note that, since this is a small amount of data, the write is effectively atomic. You could write many files' data in one go using pipe
, e.g.,
file_system.open(
{path: b"This is a test string",
path2: b"Another bytestring"}
)
This is also as atomic as is possible with a remote store.
This is still an issue with the latest fsspec==2023.9.2
and gcsfs==2023.9.2
, is there a workaround or are there plans to address this? Our script uses url_to_fs
to keep everything behind the FileSystem interface, and so far gcsfs is the only backend which breaks with transactions.
A minimal repro for my use case:
from fsspec.core import url_to_fs
fs, url = url_to_fs('gs://<my_bucket>/<path>')
with fs.transaction:
with fs.open(f'{url}/my_file', 'wb') as f:
f.write(b'This is a test string')
I would prefer not to abandon transactions entirely, but GCS support is critical for the project. Please let me know if there are any updates on this.
Do you still have the same problem if you use
fs = fsspec.filesystem("gcs")
?
Hey Martin, thanks for the quick response! Unfortunately yes, here is the script I ran:
import fsspec
fs = fsspec.filesystem('gcs')
with fs.transaction:
with fs.open('<my_bucket>/foo', 'wb') as f:
f.write(b'This is a test string')
I see the same gcsfs.retry.HttpError: Invalid request. The number of bytes uploaded is required to be equal or greater than 262144, except for the final request (it's recommended to be the exact multiple of 262144). The received request contained 21 bytes, which does not meet this requirement., 400
error
OK, so it seems like this was never working... The way it is implemented is to start a multi-part upload, which can then be confirmed or cancelled, and it's this method that has the chunk size limit. I'm not sure why it would fail with just the one small chunk, though - will look.
Actually, it does work on the GCS emulator, so maybe the API changed.
Please check with #586
Wow thank you @martindurant, this works perfectly! Really appreciate the quick turnaround on this 😄
@martindurant one last question - what does the release timeline look like for the fix?
Within two weeks?