s3fs
s3fs copied to clipboard
`mkdir` silently fail to create parent dir after version 0.2.2
After version 0.2.2, the logic of mkdir
changed and it can only be called when creating a top level dir (a bucket, e.g. mkdir('bucket')
).
Creating anything other than a bucket will lead to silent failure (the command returns without exception but nothing has been done).
Here is a illustration of the issue:
$ python3
Python 3.6.8 (default, Oct 7 2019, 12:59:55)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import s3fs
>>> s3fs.__version__
'0.3.5'
>>> fs = s3fs.S3FileSystem()
>>> # Creating a top level dir is fine
... fs.mkdir('my-test-bucket')
>>> fs.exists('my-random-test-bucket-dkaf')
True
>>> fs.rmdir('my-random-test-bucket-dkaf')
>>> # Creating a 2nd level folder does nothing
... fs.mkdir('my-random-test-bucket-dkaf/abc/')
>>> fs.exists('my-random-test-bucket-dkaf/abc')
False
>>> fs.exists('my-random-test-bucket-dkaf')
False
While in 0.2.2 an exception will be thrown if bucket doesn't exist yet
$ python3
Python 3.6.8 (default, Oct 7 2019, 12:59:55)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import s3fs
>>> s3fs.__version__
'0.2.2'
>>> fs = s3fs.S3FileSystem()
>>> fs.mkdir('my-random-test-bucket-dkaf/abc/')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 865, in mkdir
self.touch(path, acl=acl, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 1044, in touch
Bucket=bucket, Key=key, ACL=acl)
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 195, in _call_s3
return method(**additional_kwargs)
File "/home/jackwindows/.local/lib/python3.6/site-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/jackwindows/.local/lib/python3.6/site-packages/botocore/client.py", line 661, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchBucket: An error occurred (NoSuchBucket) when calling the PutObject operation: The specified bucket does not exist
>>> fs.mkdir('my-random-test-bucket-dkaf')
>>> fs.mkdir('my-random-test-bucket-dkaf/abc/')
>>> fs.exists('my-random-test-bucket-dkaf')
True
>>> fs.exists('my-random-test-bucket-dkaf/abc/')
True
I think we should be able to use a create_parent
param (according to fsspec) to automatically create parent dir if necessary.
It may be reasonable to check if the bucket exists or is writable (although the latter is not easy), but the no-op is exactly the right thing to do, since s3 does not support folders at all. Forders only exist then you create keys with the appropriate prefix. In the console you can simulate folders, but it actually creates placeholder files to do this.
@martindurant I believe in version 0.2.2 it does create a placeholder when you mkdir
a folder.
Also s3fs document for mkdir
(link) says
Make new bucket or empty key
So I definitely wouldn't expect a no-op
, the fs.exists
call also returns confusing result if a no-op
is used here.
I would say that the doc should be updated rather than the other way around. This issue has been gone around a few cycles of exactly what to do here, and in general I would say that you should simply not attempt to manipulate directories on s3, since you can write "bucket/path1/path2/file" without any need for the intermediate directories to exist.
That would be really a weird choice for s3fs
project.
There is clearly a gap between sf3s
project's version v.s. the general concept of fs interface.
E.g. I would expect I can create folders, and I would expect after I create the folders fs.exists
returns True
.
A good example is that aws s3cli already simulates this behavior, and older version of s3fs
also does similar thing.
It is very strange that s3fs
choose the other way now, if so I would say you should very well just remove mkdir
API to avoid confusion and advertise the design idea of s3fs
.
There is clearly a gap between sf3s project's version v.s. the general concept of fs interface.
Or perhaps a gap between what s3 is and what people might think of as a file system?
I am happy to follow AWS's lead on this, but am I wrong in thinking that the CLI doesn't concern itself with directories at all? https://docs.aws.amazon.com/cli/latest/reference/s3/index.html#available-commands
just remove mkdir
It exists in the superclass
I am happy to follow AWS's lead on this, but am I wrong in thinking that the CLI doesn't concern itself with directories at all? https://docs.aws.amazon.com/cli/latest/reference/s3/index.html#available-commands
Oh yeah, you are right, I'm wrong on this. AWS CLI doesn't really have the concept of folder either. But in general AWS S3 does have the concept of folder. For example, in AWS web console, there is actually a button to create a folder in a bucket.
When you create a folder, S3 console creates an object with the above name appended by suffix "/" and that object is displayed as a folder in the S3 console. Choose the encryption setting for the object:
I think it would be nicer for s3fs
project to follow similar design pattern so we can fit better into the fssepc
framework.
We did indeed do that before, but it causes problems: when doing ls("bucket/path")
(or perhaps including a trailing "/"), do we return the information for the key with that name too? It also has details and metadata. If we also have files with that prefix, then we will see both a file and a "common prefix" with the same name, whereas if we just put the files, it will not create the dir-like empty file. I prefer to follow the better-defined behaviour of the CLI tools...
I'd like to echo @JackWindows points. I get that S3 as a filesystem is an object store and folders don't really exist as first-class concepts. However, if someone wants to deal with the exact semantics of S3, they probably should just use the boto3/botocore libraries directly.
I work on a project where fsspec/s3fs are specifically being used for the abstraction (as implied by AbstractFileSystem base class). We'd like to get to common semantics where reasonable, otherwise the abstraction is leaky at best, littering the client code with conditionals dependent on the protocol in play. For my use-case, I'm not worried about maximizing performance by being as close as possible to the storage backend. Instead I want flexible code that "just works" when a backend is swapped out.
What about a parameter for S3FS (or any implementation of AbstractFileSystem) that allows the user to decide if they want "emulated" behavior or "native" behavior? Where "emulated" means that there is a well-defined semantics for every operation of AbstractFileSystem that all derived classes implement, and "native" means that semantics vary by implementation according to what makes sense by the underlying file store.
(As an aside, the discussion on this thread is maybe getting at the point that implementers want the semantics documented so they know whether they are compliant or not.)
I agree that the littering of conditionals should be in this library and not yours.
I would argue that s3fs is already in "emulated" mode, but (see the long discussion in #300 ). This is why, for instance, we strip trailing "/" from paths before handling, since they path names ought not to be like that in a filesystem context, even though S3 allows it.
Probably the best thing to do would be to come up with test cases that can be added to s3fs, demonstrating the behaviour you think should be right, and then go about implementing them.
So... to creating parent directories using mkdirs
, I would suggest that this should attempt to make the bucket in question if it doesn't exist (which can cause an error, as it should) only. So a test for existence of the directory afterwards would still fail unless some content has been put in the directory.
So a test for existence of the directory afterwards would still fail unless some content has been put in the directory.
Or perhaps the test for existing of a folder should always be true? What often is happening is that client code is checking if a dir exists before writing to it. In many filesystems this is required, in S3, it is not. Semantically, as long as the bucket exists, than any directory "exists" already on S3 from the perspective that you don't need to create it first.
Or perhaps the test for existing of a folder should always be true?
I can see your reasoning, but
- how do you know that
exists
is intending to test for a directory? - should be able to access the bucket at least?
Currently, ls('existingbucket/notadirectory')
returns []
, and ls('notabucket/notadirectory')
gives FileNotFoundError
.
Another case where I ran into the same problem: trying to write generic code based on the fsspec AbstractFileSystem implementation for an fsspec-wrapper in pyarrow
.
The specification has a mkdir(.., create_parents=True)
, but this doesn't work for s3fs
create_parents
should indeed be added in s3fs, and should create the bucket only, if it doesn't exist - following the commentary above.
Hi. I'm my project I'm using S3FileSystem as a drop-in for regular local file system functions, since it supposedly supports fsspec. I really don't care whether S3 supports actually folders or not, but if S3FileSystem implements fsspec, therefore S3Filesystem must emulate folder support, else please don't inherit from fsspec and don't call it a FileSystem, because currently it is not. e.g., other folder specific operations like listdir() return everything in the bucket instead of a single folder. Currently the situation is like a hybrid between claiming support for folders but not implementing anything. So I have to make my own implementation of a real FileSystem that users S3FileSystem underneath and tries to fixes issues with folders. Thank you.
@joaoe , please start a new issue specifying code you are executing, your expected behaviour and what actually happens.
@ianthomas23 , my last comment above would be a quick PR, if this has not been implemented since.
create_parents should indeed be added in s3fs, and should create the bucket only, if it doesn't exist - following the commentary above.
EDIT: I have checked and s3fs does indeed support create_parents in mkdir.