s3fs icon indicating copy to clipboard operation
s3fs copied to clipboard

S3Map always attempts to create parent directiories

Open Langdi opened this issue 2 years ago • 3 comments

Problem

I am using ZARR (https://zarr.readthedocs.io) with an S3 Backend. Zarr knows the concept of a zarr store, which is basically an S3Map:

s3 = s3fs.S3FileSystem(...)
store = s3fs.S3Map(root="s3://foo/bar/baz")

When I now attempt to set a property with store['.zmetadata'] = "value": I get the following error stack:

Traceback (most recent call last):
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/s3fs/core.py", line 113, in _error_wrapper
    return await func(*args, **kwargs)
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/aiobotocore/client.py", line 378, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Access Denied

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/zarr/storage.py", line 1421, in __setitem__
    self.map[key] = value
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/fsspec/mapping.py", line 162, in __setitem__
    self.fs.mkdirs(self.fs._parent(key), exist_ok=True)
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/fsspec/spec.py", line 1444, in mkdirs
    return self.makedirs(path, exist_ok=exist_ok)
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/fsspec/asyn.py", line 121, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/fsspec/asyn.py", line 106, in sync
    raise return_result
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/fsspec/asyn.py", line 61, in _runner
    result[0] = await coro
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/s3fs/core.py", line 892, in _makedirs
    await self._mkdir(path, create_parents=True)
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/s3fs/core.py", line 877, in _mkdir
    await self._call_s3("create_bucket", **params)
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/s3fs/core.py", line 348, in _call_s3
    return await _error_wrapper(
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/s3fs/core.py", line 140, in _error_wrapper
    raise err
PermissionError: Access Denied

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/langdi/s3-upload-test/venv/lib/python3.10/site-packages/zarr/storage.py", line 1424, in __setitem__
    raise KeyError(key) from e
KeyError: '.zmetadata'

Disassembling the call stack

When looking at the calls, the relevant parts are:

fsspec/mapping.py

    def __setitem__(self, key, value):
        """Store value in key"""
        key = self._key_to_str(key)
        self.fs.mkdirs(self.fs._parent(key), exist_ok=True)
        self.fs.pipe_file(key, maybe_convert(value))

Here mkdirs is called on the filesystem, which eventually leads to

s3fs/core.py

    async def _makedirs(self, path, exist_ok=False):
        try:
            await self._mkdir(path, create_parents=True)
        # ....

The problem is that create_parents=True is passed without a way to set the parameter to something else. This makes S3FS always attempt to create the parent dictionary/bucket, requiring create_bucket permissions in S3, even if the bucket already exists.

I am not sure where to fix this, ideally there is a way to check for existence before attempting to create the parent directories/objects. Perhaps it can be conditional on fsspec's auto_mkdir parameter (https://filesystem-spec.readthedocs.io/en/latest/api.html), although this could create other problems. E.g. when setting "s3://foo/bar/baz", it might be that /fooexists, but/bar/baz` needs to be created.

Used versions

zarr==2.15.0
s3fs==2023.6.0
fsspec==2023.6.0

Langdi avatar Aug 04 '23 09:08 Langdi

The right thing to do is probably try to create the bucket and ignore if that errors (for any reason), which may lead to an error later when trying to write. Unfortunately, the permission model in AWS is complex, as there are separate ones for listing buckets, getting a bucket's info and listing of files within a bucket - you don't necessarily have any of these just because you can write.

martindurant avatar Aug 04 '23 12:08 martindurant

FWIW a similar issue was discussed over at https://github.com/rclone/rclone/issues/4449 and their fix was to add a flag which just skips the bucket existence check/creation attempt entirely, skipping straight to the PUT call.

Perhaps it can be conditional on fsspec's auto_mkdir parameter (https://filesystem-spec.readthedocs.io/en/latest/api.html), although this could create other problems. E.g. when setting "s3://foo/bar/baz", it might be that /foo exists, but /foo/bar/baz needs to be created.

Folders don't actually exist in S3 (files are stored as bucket/<object key>, folders are a fuzzy abstraction for objects with / in their key) so there's no such thing as an empty directory below bucket level.

If the bucket doesn't exist, a later GET/PUT call will fail with a "bucket doesn't exist" response anyway; arguably you'd be better off just no-opping mkdirs(exist_ok=True) entirely and catching nonexistent bucket errors later, but that might be a bit confusing for users.

neggles avatar Aug 14 '23 03:08 neggles

@neggles , I think you are exactly right

I think it would be very reasonable to have the mkdirs call be optional, especially given

  • the "create" flag in the map's init
  • the option the localfilesystem (and maybe others where it matters) has to always create needed directories on write.

martindurant avatar Aug 14 '23 14:08 martindurant