filesystem_spec
filesystem_spec copied to clipboard
fsspec fails trying to create a bucket when writing to S3 when folder/prefix doesn't exists
fsspec fails trying to create a bucket when writing to S3 when folder/prefix doesn't exists
Problem
By default when writing with fsspec to remote filesystems fsspec
sets the flag auto_mkdir=True
for creating the path hirerachy.
I found that #212 deprecates auto_mkdir=True
for LocalFileSystem
.
However this behaviour causes unexpected throubles and should be disabled, IMHO.
For instance, when writing to S3 using fsspec
and s3fs
if the folder doesn't exist yet, it tries to create the bucket and fails. If passing auto_mkdir=False
it works as expected.
Notice that S3 automatically creates folders when writing files if they are missing.
Test Case
The following code reproduces the problem:
import fsspec
s3_path = 's3://my-bucket/path/to/folder/that/not/exists/yet/test.txt'
with fsspec.open(s3_path, mode='wb', compression='infer', **self.kwargs) as fs:
f.write('Some text.\n')
f.write('More text.\n') # data is flushed and file closed
This raises a exception like the following:
Traceback (most recent call last):
File "/venv/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 979, in _run_raw_task
result = task_copy.execute(context=context)
File "/venv/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/venv/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/dags/examples/s3sfs_debug.py", line 146, in _test_fsjuarz_auto_mkdir_true
with fsspec.open(s3_path, mode='w', compression='infer', auto_mkdir=True) as fs:
File "/venv/lib/python3.7/site-packages/fsspec/core.py", line 399, in open
**kwargs
File "/venv/lib/python3.7/site-packages/fsspec/core.py", line 254, in open_files
[fs.makedirs(parent, exist_ok=True) for parent in parents]
File "/venv/lib/python3.7/site-packages/fsspec/core.py", line 254, in <listcomp>
[fs.makedirs(parent, exist_ok=True) for parent in parents]
File "/venv/lib/python3.7/site-packages/s3fs/core.py", line 460, in makedirs
self.mkdir(path, create_parents=True)
File "/venv/lib/python3.7/site-packages/fsspec/asyn.py", line 100, in wrapper
return maybe_sync(func, self, *args, **kwargs)
File "/venv/lib/python3.7/site-packages/fsspec/asyn.py", line 80, in maybe_sync
return sync(loop, func, *args, **kwargs)
File "/venv/lib/python3.7/site-packages/fsspec/asyn.py", line 51, in sync
raise exc.with_traceback(tb)
File "/venv/lib/python3.7/site-packages/fsspec/asyn.py", line 35, in f
result[0] = await future
File "/venv/lib/python3.7/site-packages/s3fs/core.py", line 450, in _mkdir
raise translate_boto_error(e) from e
FileExistsError: The requested bucket name is not available. The bucket namespace is shared by all users of the system. Please select a different name and try again.
This happens because of the parameter auto_mkdir=True
on this code:
def open_files(
urlpath,
mode="rb",
compression=None,
encoding="utf8",
errors=None,
name_function=None,
num=1,
protocol=None,
newline=None,
auto_mkdir=True,
**kwargs
):
Sorry, what should the correct behaviour be here?
(and is this issue specific to s3fs?)
Hi,
Sorry, what should the correct behaviour be here?
The safer behavior IMHO may be auto_mkdir=False
in:
def open_files(
urlpath,
mode="rb",
compression=None,
encoding="utf8",
errors=None,
name_function=None,
num=1,
protocol=None,
newline=None,
auto_mkdir=False, # <-- DEFAULT VALUE CHANGED HERE!
**kwargs
):
Some reasons for changing the default behavior from True
to False
/None
:
- It's not expected by the programmer that the default behavior is automatically creating folders.
-
fsspec
could be doing things that the programmer didn't intended. - It's not clear neither expected that a call to
fsspec.open(my_path, mode='wb')
will create folders. - In other filesystem implementations auto creating folders is opt-in. E.g:
mkdir -p /path/to/new/folder
-
- It can trigger unexpected behavior that can be different according to the
package
and remote filesystem implementation.- In this case the behavior on
S3
can be different that the behavior onsmb
, for instance. - It will be hard to test and assure all packages/filesystems will have a consistent behavior.
- In this case the behavior on
- It can trigger nonexistent problems compared to not creating folders by default.
-
s3fs
tries to create the bucket as triggered in this case. - System user executing the may not have permissions for creating folders in some part of the path.
- A remote filesystem may not have a concept of folders but can have a concept of file/bytes.
-
- It's hard disable the behavior by passing
auto_mkdir=False
in every call tofsspec
.- While the change is simple, remembering and managing every call can be a burden.
- It will be harder if your code is calling other third party
package
that callsfsspec
but you can change the call in this package.
Also some considerations about the change:
- Some people could argue for not changing the behavior:
- This can trigger some incompatibility in existing code.
- It's neat and easier in some cases to have this behavior enabled.
- Maybe
fsspec
could have a global flag accessible to the program/thread for controlling this behavior in allfsspec
.- This can be used when changing the behavior as an easier way or keeping compatibility.
- It will be easier setting this flag than passing
auto_mkdir=False
in every call tofsspec
. - In some cases the call to fsspec can be buried inside a third party package.
(and is this issue specific to s3fs?)
I have found the problem while writing into S3
a file opened with fsspec
and s3fs
.
But it's possible that other combinations of fsspec
+ remote package could trigger other failures and unexpected filesystem changes.
In this case, S3 has a peculiarity that triggered the problem:
- S3 has a different concept/implementation/consistency when compared with common hierarchical filesystems.
- A file in S3 is represented by bucket/prefix like in a key/value database.
- In this case prefix is the complete path for the file including folder.
- For creating a file you can just supply the full path to file.
- It's not required that you create the folders before writing a file.
Happy to hear more opinions here. Note that it's difficult to come up with the right default behaviour that matches most users' expectations.
Maybe fsspeccould have a global flag accessible to the program/thread for controlling this behavior in all fsspec.
This would be easy to implement from a andvironment variable or blag in the spec module. It could also be a per-implementation class attribute.
It's not required that you create the folders before writing a file.
True for the prefix, but you must create the bucket - or the bucket must already exist (and there is no way to test whether the bucket exists except trying to do something with it)
Happy to hear more opinions here. Note that it's difficult to come up with the right default behaviour that matches most users' expectations.
For sure. Always there is somebody that will be unsatisfied no matter what is accomplished.
Maybe fsspeccould have a global flag accessible to the program/thread for controlling this behavior in all fsspec.
This would be easy to implement from a andvironment variable or blag in the spec module.
It's a possible workaround.
But for this to be effective it also should be accessible for people that are using fsspec
indirectly through other packages that the user can't change easily.
It could also be a per-implementation class attribute.
I agree. Certainly auto_mkdir=False
makes more sense for some filesystem implementation that others.
Also it's possible to evaluate other restrictions for narrowing when this defaults applies like:
- Currently the default
auto_mkdir=True
applies only when theopen mode
is writing. - There must be other criteria that can be analyzed for automatically creating folder like:
- local/remote filesystem
- possibility of failures and developer/user headaches...
- how to deal with permissions in the whole path to file.
- other filesystem characteristics.
- etc...
- Maybe narrowing the conditions for
auto_mkdir=True
will make the user experience seamless.
It's not required that you create the folders before writing a file.
True for the prefix, but you must create the bucket - or the bucket must already exist (and there is no way to test whether the bucket exists except trying to do something with it)
Yes. It' true.
But in my experience, I noticed the following:
- Buckets are used more like a remote drive that a remote folder.
- People tend to create buckets manually before rather than create by an application or script.
- Creating the bucket like in the case of using a
open
call tofsspec
is pretty much a exception that a common behavior.
Maybe this could be one more criteria for narrowing auto_mkdir=True
:
- The
open
call should not create a remote resource like: bucket, share, drive etc... - But folders are allowed to be automatically be created...
- I don't know if is easy to distinguish between folder and a "root" resource.
Sorry for letting this hang following your detailed write-up - but I'm still not sure what the next step ought to be, particularly following your final points, in which we don't necessarily know whether an entity is local or remote, and whether a resource is folder-like or higher.
The best tangential solution I can think of is a configuration system for setting global and per-instance behaviour and/or a mechanism to define the default instance for a given protocol.
The best tangential solution I can think of is a configuration system for setting global and per-instance behaviour and/or a mechanism to define the default instance for a given protocol.
Looks good to me.
Maybe with a mechanism like this will be possible:
- When directory creation is known to be safe, like in local filesystem, it can default to
auto_mkdir=True
. - Avoid creating root resources, like share names in
SMB
on URIs likesmb://myserver.com/share/folder/file.ext
. Maybe using adittional info likeauto_mkdir=1
orauto_mkdir_depth=1
; - When the folder is automatically created it can default to
auto_mkdir=False
to disable directory creation on filesystems likeS3
. - Also disallow directory creation when it does not make sense for the filesystem.
Also we also would need a transition strategy like:
- For Built-in Implementations it's possible to analyse and select the best default value to
auto_mkdir
. - For External implementations it would be wise to:
- Start with
auto_mkdir=True
for avoiding breakage. - Tell developers to override the default in their code.
- Wait some time for developers to catch up.
- Change the default to
auto_mkdir=False
for new implementations.
- Start with
https://github.com/intake/filesystem_spec/pull/453 allows you to pass auto_mkdir as a config option to any file-system implementation. This doesn't yet solve your issue here, as most implementations don't have this as an explicit kwarg option, and the value is coming instead from the default on fsspec.open_files
. It's a start, though.
Hi folks,
I'm arriving here after "accidentally" creating a whole new s3 bucket using pandas' to_csv()
method and having a typo in the bucket name of my s3 uri. (see my issue https://github.com/pandas-dev/pandas/issues/50135)
Frankly, it seems pretty wild to me that auto_mkdir
defaults to True
. I'm sure I may be blind to some use cases of fsspec
where this is reasonable, but I'd say it definitely defies all my expectations -- especially given the standard libraries' behavior for writing to local disk (here and here), and than everywhere else I can think of this auto-creation behavior is always opt-in.
Points 1 through 4 brought up by @juarezr in https://github.com/fsspec/filesystem_spec/issues/397#issuecomment-687232278 seem like a robust rationale for changing the behavior. As far breaking backward compatibility, maybe this warrants a warning release first? @martindurant any thoughts?
it seems pretty wild to me that auto_mkdir defaults to True
This is largely down to the how the library is used in dask (and other libraries like zarr) that don't want to concern themselves with creating directories, but do expect that the location specified is already writable. Consider the case that a worker process wants to write, but it has no knowledge of how/if the output has been prepared by the client beforehand.