filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

fsspec fails trying to create a bucket when writing to S3 when folder/prefix doesn't exists

Open juarezr opened this issue 4 years ago • 9 comments

fsspec fails trying to create a bucket when writing to S3 when folder/prefix doesn't exists

Problem

By default when writing with fsspec to remote filesystems fsspec sets the flag auto_mkdir=True for creating the path hirerachy.

I found that #212 deprecates auto_mkdir=True for LocalFileSystem.

However this behaviour causes unexpected throubles and should be disabled, IMHO.

For instance, when writing to S3 using fsspec and s3fs if the folder doesn't exist yet, it tries to create the bucket and fails. If passing auto_mkdir=False it works as expected.

Notice that S3 automatically creates folders when writing files if they are missing.

Test Case

The following code reproduces the problem:

import fsspec

s3_path = 's3://my-bucket/path/to/folder/that/not/exists/yet/test.txt'

with fsspec.open(s3_path, mode='wb', compression='infer', **self.kwargs) as fs:
    f.write('Some text.\n')
    f.write('More text.\n') # data is flushed and file closed

This raises a exception like the following:

Traceback (most recent call last):
  File "/venv/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 979, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/venv/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
    return_value = self.execute_callable()
  File "/venv/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/dags/examples/s3sfs_debug.py", line 146, in _test_fsjuarz_auto_mkdir_true
    with fsspec.open(s3_path, mode='w', compression='infer', auto_mkdir=True) as fs:
  File "/venv/lib/python3.7/site-packages/fsspec/core.py", line 399, in open
    **kwargs
  File "/venv/lib/python3.7/site-packages/fsspec/core.py", line 254, in open_files
    [fs.makedirs(parent, exist_ok=True) for parent in parents]
  File "/venv/lib/python3.7/site-packages/fsspec/core.py", line 254, in <listcomp>
    [fs.makedirs(parent, exist_ok=True) for parent in parents]
  File "/venv/lib/python3.7/site-packages/s3fs/core.py", line 460, in makedirs
    self.mkdir(path, create_parents=True)
  File "/venv/lib/python3.7/site-packages/fsspec/asyn.py", line 100, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/venv/lib/python3.7/site-packages/fsspec/asyn.py", line 80, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/venv/lib/python3.7/site-packages/fsspec/asyn.py", line 51, in sync
    raise exc.with_traceback(tb)
  File "/venv/lib/python3.7/site-packages/fsspec/asyn.py", line 35, in f
    result[0] = await future
  File "/venv/lib/python3.7/site-packages/s3fs/core.py", line 450, in _mkdir
    raise translate_boto_error(e) from e
FileExistsError: The requested bucket name is not available. The bucket namespace is shared by all users of the system. Please select a different name and try again.

This happens because of the parameter auto_mkdir=True on this code:


def open_files(
    urlpath,
    mode="rb",
    compression=None,
    encoding="utf8",
    errors=None,
    name_function=None,
    num=1,
    protocol=None,
    newline=None,
    auto_mkdir=True,
    **kwargs
):

juarezr avatar Sep 03 '20 22:09 juarezr

Sorry, what should the correct behaviour be here?

martindurant avatar Sep 04 '20 13:09 martindurant

(and is this issue specific to s3fs?)

martindurant avatar Sep 04 '20 13:09 martindurant

Hi,

Sorry, what should the correct behaviour be here?

The safer behavior IMHO may be auto_mkdir=False in:

def open_files(
    urlpath,
    mode="rb",
    compression=None,
    encoding="utf8",
    errors=None,
    name_function=None,
    num=1,
    protocol=None,
    newline=None,
    auto_mkdir=False, # <-- DEFAULT VALUE CHANGED HERE!
    **kwargs
):

Some reasons for changing the default behavior from True to False/None:

  1. It's not expected by the programmer that the default behavior is automatically creating folders.
    1. fsspec could be doing things that the programmer didn't intended.
    2. It's not clear neither expected that a call to fsspec.open(my_path, mode='wb') will create folders.
    3. In other filesystem implementations auto creating folders is opt-in. E.g: mkdir -p /path/to/new/folder
  2. It can trigger unexpected behavior that can be different according to the package and remote filesystem implementation.
    1. In this case the behavior on S3 can be different that the behavior on smb, for instance.
    2. It will be hard to test and assure all packages/filesystems will have a consistent behavior.
  3. It can trigger nonexistent problems compared to not creating folders by default.
    1. s3fs tries to create the bucket as triggered in this case.
    2. System user executing the may not have permissions for creating folders in some part of the path.
    3. A remote filesystem may not have a concept of folders but can have a concept of file/bytes.
  4. It's hard disable the behavior by passing auto_mkdir=False in every call to fsspec.
    1. While the change is simple, remembering and managing every call can be a burden.
    2. It will be harder if your code is calling other third party package that calls fsspec but you can change the call in this package.

Also some considerations about the change:

  • Some people could argue for not changing the behavior:
    • This can trigger some incompatibility in existing code.
    • It's neat and easier in some cases to have this behavior enabled.
  • Maybe fsspeccould have a global flag accessible to the program/thread for controlling this behavior in all fsspec.
    • This can be used when changing the behavior as an easier way or keeping compatibility.
    • It will be easier setting this flag than passing auto_mkdir=False in every call to fsspec.
    • In some cases the call to fsspec can be buried inside a third party package.

(and is this issue specific to s3fs?)

I have found the problem while writing into S3a file opened with fsspec and s3fs.

But it's possible that other combinations of fsspec + remote package could trigger other failures and unexpected filesystem changes.

In this case, S3 has a peculiarity that triggered the problem:

  • S3 has a different concept/implementation/consistency when compared with common hierarchical filesystems.
  • A file in S3 is represented by bucket/prefix like in a key/value database.
  • In this case prefix is the complete path for the file including folder.
  • For creating a file you can just supply the full path to file.
  • It's not required that you create the folders before writing a file.

juarezr avatar Sep 04 '20 15:09 juarezr

Happy to hear more opinions here. Note that it's difficult to come up with the right default behaviour that matches most users' expectations.

Maybe fsspeccould have a global flag accessible to the program/thread for controlling this behavior in all fsspec.

This would be easy to implement from a andvironment variable or blag in the spec module. It could also be a per-implementation class attribute.

It's not required that you create the folders before writing a file.

True for the prefix, but you must create the bucket - or the bucket must already exist (and there is no way to test whether the bucket exists except trying to do something with it)

martindurant avatar Sep 08 '20 13:09 martindurant

Happy to hear more opinions here. Note that it's difficult to come up with the right default behaviour that matches most users' expectations.

For sure. Always there is somebody that will be unsatisfied no matter what is accomplished.

Maybe fsspeccould have a global flag accessible to the program/thread for controlling this behavior in all fsspec.

This would be easy to implement from a andvironment variable or blag in the spec module.

It's a possible workaround.

But for this to be effective it also should be accessible for people that are using fsspec indirectly through other packages that the user can't change easily.

It could also be a per-implementation class attribute.

I agree. Certainly auto_mkdir=False makes more sense for some filesystem implementation that others.

Also it's possible to evaluate other restrictions for narrowing when this defaults applies like:

  • Currently the default auto_mkdir=True applies only when the open mode is writing.
  • There must be other criteria that can be analyzed for automatically creating folder like:
    • local/remote filesystem
    • possibility of failures and developer/user headaches...
    • how to deal with permissions in the whole path to file.
    • other filesystem characteristics.
    • etc...
  • Maybe narrowing the conditions for auto_mkdir=True will make the user experience seamless.

It's not required that you create the folders before writing a file.

True for the prefix, but you must create the bucket - or the bucket must already exist (and there is no way to test whether the bucket exists except trying to do something with it)

Yes. It' true.

But in my experience, I noticed the following:

  • Buckets are used more like a remote drive that a remote folder.
  • People tend to create buckets manually before rather than create by an application or script.
  • Creating the bucket like in the case of using a open call to fsspec is pretty much a exception that a common behavior.

Maybe this could be one more criteria for narrowing auto_mkdir=True:

  • The open call should not create a remote resource like: bucket, share, drive etc...
  • But folders are allowed to be automatically be created...
  • I don't know if is easy to distinguish between folder and a "root" resource.

juarezr avatar Sep 10 '20 14:09 juarezr

Sorry for letting this hang following your detailed write-up - but I'm still not sure what the next step ought to be, particularly following your final points, in which we don't necessarily know whether an entity is local or remote, and whether a resource is folder-like or higher.

The best tangential solution I can think of is a configuration system for setting global and per-instance behaviour and/or a mechanism to define the default instance for a given protocol.

martindurant avatar Sep 30 '20 15:09 martindurant

The best tangential solution I can think of is a configuration system for setting global and per-instance behaviour and/or a mechanism to define the default instance for a given protocol.

Looks good to me.

Maybe with a mechanism like this will be possible:

  • When directory creation is known to be safe, like in local filesystem, it can default to auto_mkdir=True.
  • Avoid creating root resources, like share names in SMB on URIs like smb://myserver.com/share/folder/file.ext. Maybe using adittional info like auto_mkdir=1 or auto_mkdir_depth=1;
  • When the folder is automatically created it can default to auto_mkdir=False to disable directory creation on filesystems like S3.
  • Also disallow directory creation when it does not make sense for the filesystem.

Also we also would need a transition strategy like:

  • For Built-in Implementations it's possible to analyse and select the best default value to auto_mkdir.
  • For External implementations it would be wise to:
    1. Start with auto_mkdir=True for avoiding breakage.
    2. Tell developers to override the default in their code.
    3. Wait some time for developers to catch up.
    4. Change the default to auto_mkdir=False for new implementations.

juarezr avatar Oct 06 '20 15:10 juarezr

https://github.com/intake/filesystem_spec/pull/453 allows you to pass auto_mkdir as a config option to any file-system implementation. This doesn't yet solve your issue here, as most implementations don't have this as an explicit kwarg option, and the value is coming instead from the default on fsspec.open_files. It's a start, though.

martindurant avatar Oct 15 '20 20:10 martindurant

Hi folks,

I'm arriving here after "accidentally" creating a whole new s3 bucket using pandas' to_csv() method and having a typo in the bucket name of my s3 uri. (see my issue https://github.com/pandas-dev/pandas/issues/50135)

Frankly, it seems pretty wild to me that auto_mkdir defaults to True. I'm sure I may be blind to some use cases of fsspec where this is reasonable, but I'd say it definitely defies all my expectations -- especially given the standard libraries' behavior for writing to local disk (here and here), and than everywhere else I can think of this auto-creation behavior is always opt-in.

Points 1 through 4 brought up by @juarezr in https://github.com/fsspec/filesystem_spec/issues/397#issuecomment-687232278 seem like a robust rationale for changing the behavior. As far breaking backward compatibility, maybe this warrants a warning release first? @martindurant any thoughts?

spolloni avatar Dec 09 '22 15:12 spolloni

it seems pretty wild to me that auto_mkdir defaults to True

This is largely down to the how the library is used in dask (and other libraries like zarr) that don't want to concern themselves with creating directories, but do expect that the location specified is already writable. Consider the case that a worker process wants to write, but it has no knowledge of how/if the output has been prepared by the client beforehand.

martindurant avatar Jan 02 '23 20:01 martindurant