cloudpathlib
cloudpathlib copied to clipboard
Handle `mkdir` for cloud providers that support creating directories
S3 has an interesting situation with folders.
Like other object stores like Azure, it has a flat structure, and when you upload a file to a/b/c.txt
for example, it creates an object literally named a/b/c.txt
. The directories a
and b
aren't real and don't exist. The web console has special behavior to fake those as folders in the UI. When you delete c.txt
, a
and b
will automatically be gone.
However, S3 does have another mechanism that lets you have folders. There is a "Create Folders" button in the web console, which lets you make a folders that exist even while empty. These turn out to actually be dummy files with a trailing slash. So if you create a/
, there is actually an object in your bucket a/
which is not actually a folder, but the S3 console will treat it like a folder for the UI. You can equivalently upload a file to a/
and it will do the same thing.
We need to think through the implications of this and what cloudpathlib should support.
- pathlib
PurePosixPath
strips trailing/
on instantiation, so this is something that doesn't map to a representation cleanly throughPurePosixPath
- ~As a result, it's not currently possible to create an
S3Path
object that points to an S3 folder.~ EDIT: This was incorrect. See discussion in #190. This is possible because the string representation of the input URI is the main basis for a CloudPath object, not aPurePosixPath
. - Additionally, the
S3Path.mkdir
method is not implemented.
This already leads to some tricky unintuitive behavior in our existing implementation.
In filesystems, you can't have a directory and a file share a name:
>>> p = Path()
>>> (p / "foo").touch()
>>> (p / "foo").mkdir()
Traceback (most recent call last):
...
FileExistsError: [Errno 17] File exists: 'foo'
S3 has no problem doing this. Not only that, but there is path dependence for how cloudpathlib handles it.
Making "foo"
first and then "foo/bar.txt"
works, but now masks the directory.
>>> s = CloudPath("s3://my-bucket/")
>>> (s / "foo").touch()
>>> (s / "foo" / "bar.txt").touch()
>>> (s / "foo").is_dir()
False
(s
is an S3Path
)
Making "foo/bar.txt"
first and then making "foo"
throws an error, because pathlib has some notion of representing the virtual folders, but then gets confused when you try to manipulate it.
>>> (s / "foo" / "bar.txt").touch()
>>> (s / "foo").exists()
True
>>> (s / "foo").is_dir()
True
>>> (s / "foo").touch()
Traceback (most recent call last):
...
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Overall, I think this is just going to be a problem with all flat object stores. I can also create foo
, foo/
, and foo/bar.txt
files in Azure Blob Storage and run into problems with the disambiguation. I'm not sure there are any great solutions around this. We will have to have limitations because we're both modeling the object store with a file system (with CloudPath
interface and mechanics) and actually using a file system to cache the object store.
Doing some testing with AWS CLI:
-
foo/
andfoo/bar.txt
both exist:-
aws s3 ls
showsfoo/
as a directory -
aws s3 sync
will downloadfoo/bar.txt
and ignorefoo/
-
- only
foo/
exists-
aws s3 ls
showsfoo/
as a directory -
aws s3 sync
will ignorefoo/
, not even create a directory locally
-
-
foo
andfoo/bar.txt
both exist:-
aws s3 ls
showsfoo/
as a directory andfoo
as a file -
aws s3 sync
will downloadfoo/bar.txt
first, and then error onfoo
because the directoryfoo/
already exists locally. This isn't fatal, and the process will continue to download any other files that are in the current directory.
-
In that case, I think we will need to come up with a specification of what happens, and document how it works. For example, we may decide:
- cloudpathlib cannot directly interact with files with trailing slashes (explicitly will not support S3 "folders")
- if
foo
andfoo/bar.txt
both exist, the directory will take precedence in all cloudpathlib methods.foo
will be masked and users will not be able to directly interact with it so long asfoo/bar.txt
exists. Iffoo
was in the local cache, it gets deleted from the local cache.
This seems like it would best match what the AWS CLI will do.
When we add #17 we can make tests that handle these scenarios when we define what the behavior should be.
This is also broken for blob storage using Data Lake 2 Storage which DOES have a concept of hierarchical directory namespaces.
Thanks @analog-cbarber. Can you provide a code snippet that repros the problem and some more details on the configuration?
If you create a data lake 2 storage container and create a directory in the container using mkdir()
, which works, the current implementation returns False for is_dir()
and subsequently does not support rmtree()
. I am not sure if there is a way to remove such a directory using your current API (I tried unlink
but I got an error). I can iterate over the files using glob
and remove those, but not the directory entries themselves.
az_test_client = AzureBlobClient(AZ_TEST_ACCOUNT, AZ_SAS)
az_test_path = AzureBlobPath(f'az://{AZ_TEST_CONTAINER}', client=az_test_client)
subdir = az_test_path.joinpath('subdir')
subdir.mkdir()
assert subdir.is_dir() # fails
If the trick to create an S3 folder is to touch (make sure its path ends with "/" to not get mixed with a regular file), why don't we just do the following ?
class S3Path(CloudPath):
...
# simplified version
def mkdir(self, parents=False, exist_ok=False):
path = path if str(self).endswith("/") else S3Path(f"{str(self)}/")
path.touch(exist_ok=exist_ok)
Additional useful discussion in #295, which has been consolidated to this issue.
Handling treating fake directories on S3 as directories should work correctly now with #302. Changing this issue to just be about implementing the mkdir
command for the cloud providers that support it.