cloudpathlib Handle `mkdir` for cloud providers that support creating directories

S3 has an interesting situation with folders.

Like other object stores like Azure, it has a flat structure, and when you upload a file to a/b/c.txt for example, it creates an object literally named a/b/c.txt. The directories a and b aren't real and don't exist. The web console has special behavior to fake those as folders in the UI. When you delete c.txt, a and b will automatically be gone.

However, S3 does have another mechanism that lets you have folders. There is a "Create Folders" button in the web console, which lets you make a folders that exist even while empty. These turn out to actually be dummy files with a trailing slash. So if you create a/, there is actually an object in your bucket a/ which is not actually a folder, but the S3 console will treat it like a folder for the UI. You can equivalently upload a file to a/ and it will do the same thing.

We need to think through the implications of this and what cloudpathlib should support.

pathlib PurePosixPath strips trailing / on instantiation, so this is something that doesn't map to a representation cleanly through PurePosixPath
~As a result, it's not currently possible to create an S3Path object that points to an S3 folder.~ EDIT: This was incorrect. See discussion in #190. This is possible because the string representation of the input URI is the main basis for a CloudPath object, not a PurePosixPath.
Additionally, the S3Path.mkdir method is not implemented.

Aug 29 '20 04:08 jayqi

This already leads to some tricky unintuitive behavior in our existing implementation.

In filesystems, you can't have a directory and a file share a name:

>>> p = Path()
>>> (p / "foo").touch()
>>> (p / "foo").mkdir()
Traceback (most recent call last):
...
FileExistsError: [Errno 17] File exists: 'foo'

S3 has no problem doing this. Not only that, but there is path dependence for how cloudpathlib handles it.

Making "foo" first and then "foo/bar.txt" works, but now masks the directory.

>>> s = CloudPath("s3://my-bucket/")
>>> (s / "foo").touch()
>>> (s / "foo" / "bar.txt").touch()
>>> (s / "foo").is_dir()
False

(s is an S3Path)

Making "foo/bar.txt" first and then making "foo" throws an error, because pathlib has some notion of representing the virtual folders, but then gets confused when you try to manipulate it.

>>> (s / "foo" / "bar.txt").touch()
>>> (s / "foo").exists()
True
>>> (s / "foo").is_dir()
True
>>> (s / "foo").touch()
Traceback (most recent call last):
...
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

Sep 01 '20 23:09 jayqi

Overall, I think this is just going to be a problem with all flat object stores. I can also create foo, foo/, and foo/bar.txt files in Azure Blob Storage and run into problems with the disambiguation. I'm not sure there are any great solutions around this. We will have to have limitations because we're both modeling the object store with a file system (with CloudPath interface and mechanics) and actually using a file system to cache the object store.

Doing some testing with AWS CLI:

foo/ and foo/bar.txt both exist:
- aws s3 ls shows foo/ as a directory
- aws s3 sync will download foo/bar.txt and ignore foo/
only foo/ exists
- aws s3 ls shows foo/ as a directory
- aws s3 sync will ignore foo/, not even create a directory locally
foo and foo/bar.txt both exist:
- aws s3 ls shows foo/ as a directory and foo as a file
- aws s3 sync will download foo/bar.txt first, and then error on foo because the directory foo/ already exists locally. This isn't fatal, and the process will continue to download any other files that are in the current directory.

In that case, I think we will need to come up with a specification of what happens, and document how it works. For example, we may decide:

cloudpathlib cannot directly interact with files with trailing slashes (explicitly will not support S3 "folders")
if foo and foo/bar.txt both exist, the directory will take precedence in all cloudpathlib methods. foo will be masked and users will not be able to directly interact with it so long as foo/bar.txt exists. If foo was in the local cache, it gets deleted from the local cache.

This seems like it would best match what the AWS CLI will do.

Sep 02 '20 01:09 jayqi

When we add #17 we can make tests that handle these scenarios when we define what the behavior should be.

Sep 04 '20 17:09 pjbull

This is also broken for blob storage using Data Lake 2 Storage which DOES have a concept of hierarchical directory namespaces.

Aug 28 '21 17:08 analog-cbarber

Thanks @analog-cbarber. Can you provide a code snippet that repros the problem and some more details on the configuration?

Aug 28 '21 17:08 pjbull

If you create a data lake 2 storage container and create a directory in the container using mkdir(), which works, the current implementation returns False for is_dir() and subsequently does not support rmtree(). I am not sure if there is a way to remove such a directory using your current API (I tried unlink but I got an error). I can iterate over the files using glob and remove those, but not the directory entries themselves.

az_test_client = AzureBlobClient(AZ_TEST_ACCOUNT, AZ_SAS)
az_test_path = AzureBlobPath(f'az://{AZ_TEST_CONTAINER}', client=az_test_client)
subdir = az_test_path.joinpath('subdir')
subdir.mkdir()
assert subdir.is_dir() # fails

Aug 28 '21 19:08 analog-cbarber

If the trick to create an S3 folder is to touch (make sure its path ends with "/" to not get mixed with a regular file), why don't we just do the following ?

class S3Path(CloudPath):
    ...
    # simplified version
    def mkdir(self, parents=False, exist_ok=False):
        path = path if str(self).endswith("/") else S3Path(f"{str(self)}/")
        path.touch(exist_ok=exist_ok)

Sep 13 '22 14:09 SlimFrkh

Additional useful discussion in #295, which has been consolidated to this issue.

Dec 19 '22 00:12 pjbull

Handling treating fake directories on S3 as directories should work correctly now with #302. Changing this issue to just be about implementing the mkdir command for the cloud providers that support it.

Dec 19 '22 06:12 pjbull

cloudpathlib cloudpathlib copied to clipboard

Handle `mkdir` for cloud providers that support creating directories

cloudpathlib
cloudpathlib copied to clipboard