S3Map always attempts to create parent directiories
Problem
I am using ZARR (https://zarr.readthedocs.io) with an S3 Backend. Zarr knows the concept of a zarr store, which is basically an S3Map:
s3 = s3fs.S3FileSystem(...)
store = s3fs.S3Map(root="s3://foo/bar/baz")
When I now attempt to set a property with store['.zmetadata'] = "value": I get the following error stack:
Traceback (most recent call last):
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/s3fs/core.py", line 113, in _error_wrapper
return await func(*args, **kwargs)
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/aiobotocore/client.py", line 378, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Access Denied
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/zarr/storage.py", line 1421, in __setitem__
self.map[key] = value
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/fsspec/mapping.py", line 162, in __setitem__
self.fs.mkdirs(self.fs._parent(key), exist_ok=True)
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/fsspec/spec.py", line 1444, in mkdirs
return self.makedirs(path, exist_ok=exist_ok)
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/fsspec/asyn.py", line 121, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/fsspec/asyn.py", line 106, in sync
raise return_result
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/fsspec/asyn.py", line 61, in _runner
result[0] = await coro
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/s3fs/core.py", line 892, in _makedirs
await self._mkdir(path, create_parents=True)
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/s3fs/core.py", line 877, in _mkdir
await self._call_s3("create_bucket", **params)
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/s3fs/core.py", line 348, in _call_s3
return await _error_wrapper(
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/s3fs/core.py", line 140, in _error_wrapper
raise err
PermissionError: Access Denied
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/zarr/storage.py", line 1424, in __setitem__
raise KeyError(key) from e
KeyError: '.zmetadata'
Disassembling the call stack
When looking at the calls, the relevant parts are:
fsspec/mapping.py
def __setitem__(self, key, value):
"""Store value in key"""
key = self._key_to_str(key)
self.fs.mkdirs(self.fs._parent(key), exist_ok=True)
self.fs.pipe_file(key, maybe_convert(value))
Here mkdirs is called on the filesystem, which eventually leads to
s3fs/core.py
async def _makedirs(self, path, exist_ok=False):
try:
await self._mkdir(path, create_parents=True)
# ....
The problem is that create_parents=True is passed without a way to set the parameter to something else. This makes S3FS always attempt to create the parent dictionary/bucket, requiring create_bucket permissions in S3, even if the bucket already exists.
I am not sure where to fix this, ideally there is a way to check for existence before attempting to create the parent directories/objects. Perhaps it can be conditional on fsspec's auto_mkdir parameter (https://filesystem-spec.readthedocs.io/en/latest/api.html), although this could create other problems. E.g. when setting "s3://foo/bar/baz", it might be that /fooexists, but/bar/baz` needs to be created.
Used versions
zarr==2.15.0
s3fs==2023.6.0
fsspec==2023.6.0
The right thing to do is probably try to create the bucket and ignore if that errors (for any reason), which may lead to an error later when trying to write. Unfortunately, the permission model in AWS is complex, as there are separate ones for listing buckets, getting a bucket's info and listing of files within a bucket - you don't necessarily have any of these just because you can write.
FWIW a similar issue was discussed over at https://github.com/rclone/rclone/issues/4449 and their fix was to add a flag which just skips the bucket existence check/creation attempt entirely, skipping straight to the PUT call.
Perhaps it can be conditional on fsspec's auto_mkdir parameter (https://filesystem-spec.readthedocs.io/en/latest/api.html), although this could create other problems. E.g. when setting "s3://foo/bar/baz", it might be that
/fooexists, but/foo/bar/bazneeds to be created.
Folders don't actually exist in S3 (files are stored as bucket/<object key>, folders are a fuzzy abstraction for objects with / in their key) so there's no such thing as an empty directory below bucket level.
If the bucket doesn't exist, a later GET/PUT call will fail with a "bucket doesn't exist" response anyway; arguably you'd be better off just no-opping mkdirs(exist_ok=True) entirely and catching nonexistent bucket errors later, but that might be a bit confusing for users.
@neggles , I think you are exactly right
I think it would be very reasonable to have the mkdirs call be optional, especially given
- the "create" flag in the map's init
- the option the localfilesystem (and maybe others where it matters) has to always create needed directories on write.