is_dir() and is_file() are not working properly for gs
Hey,
I tried to use the is_dir and is_file functions from both s3 and gs, but discovered that:
from cloudpathlib import AnyPath
p1 = AnyPath("gs://my-bucket/test/test_dir")
p2 = AnyPath("gs://my-bucket/test/test_dir/")
print(p1.is_dir()) # True
print(p2.is_dir()) # False
p1 = AnyPath("s3://my-bucket/test/test_dir")
p2 = AnyPath("s3://my-bucket/test/test_dir/")
print(p1.is_dir()) # True
print(p2.is_dir()) # True
looks like in gs everything is classified as a file unless I strip the last "/".
Any Idea why is this happening?
using cloudpathlib==0.20.0
Thanks in advance
This doesn't repro generically, so it has something to do with the configuration of your bucket/storage and the objects that actually exist in your storage. For example, I see:
In [1]: from cloudpathlib import CloudPath
In [2]: CloudPath('gs://cloudpathlib-test-bucket/performance_tests/').is_dir()
Out[2]: True
In [2]: CloudPath('gs://cloudpathlib-test-bucket/performance_tests').is_dir()
Out[2]: True
Do you have more information about your use case?
A few helpful questions:
- Are there actual blob objects that exist and end with a
/? If so, do you know how these got created? - Is your bucket configured with a different folder type than the standard Simulated Folders?
- Could you provide all of the object metadata for the folders that we say are files?
Thanks for the response! I'm still getting this for some reason, I just created a folder using the gs ui:
>>> from cloudpathlib import AnyPath
>>> p1 = AnyPath("gs://airis-packages-tests/folder1")
>>> p1.exists()
True
>>> p1.is_dir()
True
>>> p2 = AnyPath("gs://airis-packages-tests/folder1/")
>>> p2.is_dir()
False
For some reason the blob metadata function runs only on "folder1/" and on "folder" I got None:
Blob: folder1/
Bucket: airis-packages-tests
Storage class: STANDARD
ID: airis-packages-tests/folder1//1735484279960870
Size: 0 bytes
Updated: 2024-12-29 14:58:00.020000+00:00
Generation: 1735484279960870
Metageneration: 1
Etag: CKbqlOCezYoDEAE=
Owner: None
Component count: None
Crc32c: AAAAAA==
md5_hash: 1B2M2Y8AsgTpgAmY7PhCfg==
Cache-control: None
Content-type: text/plain
Content-disposition: None
Content-encoding: None
Content-language: None
Metadata: None
Medialink: https://storage.googleapis.com/download/storage/v1/b/airis-packages-tests/o/folder1%2F?generation=1735484279960870&alt=media
Custom Time: None
Temporary hold: disabled
Event based hold: disabled
Retention mode: None
Retention retain until time: None
From this looks like the blob object has "/' which I don't know why. Maybe that's the issue? The bucket itself was created with gcp's default configuration
Quick update: I tried creating a path with gsutil: gsutil cp file1.txt gs://airis-packages-tests/folder8/123.txt and is_dir worked perfectly fine on folder8. Seems like the issue is with creating folders with gs UI, maybe it creates an empty blob or something like that. Is it possible to support that as well? Thanks in advance!
Hey @pjbull, any updates on this? 😅
I similarly created a folder via the gui on a simulated fs bucket (default) and the following snippet returned something interesting:
bucket.get_blob("aaa/")._properties
returns
{"kind": "storage#object", "id": "{bucket}/aaa//{some_id} "...}
this seems to me like GUI generated folders are considered as file objects (but not really? since this object can have children)
I also performed:
p = GSPath("bucket/aaa/file.txt")
p.touch()
p.unlink()
and even after that operation the folder seems to persist in gcs for some weird reason, whereas non gui created folders automatically gets deleted if all children blobs are removed.
On why .is_dir() fails, my guess is because
https://github.com/drivendataorg/cloudpathlib/blob/master/cloudpathlib/gs/gsclient.py#L155
is_file_or_dir considers the path a file object so long as bucket.get_blob yields a non null value.
note I tested .if_file() and it returned True
so potentially a fix is just to embed a check inside the if block on L155 which makes an exception for values ending in /?
i.e.
if blob is not None:
if blob.name.endswith("/"):
return "dir"
return "file"
something like that could potentially work? happy to submit something if this solution is acceptable
Hi @shanirosen-airis, @fafnirZ,
Unfortunately, this is a tricky gotcha that we haven't totally figured out the right UX for.
It is indeed the case that when you create a folder in the web console GUI, there is a fake file that gets created with the folder's name.
This is because object stores generally have a "flat" address space. Unlike the file system on your computer, they don't have the concept of a folder at all. This is true not just of GCS, but also Amazon S3 and Azure Blob Storage. When you have an object gs://somebucket/somedir/myfile.txt, the object's identifier is the entire string somedir/myfile.txt where the / character is not special and no different than the r or the m.
The natural question is then: what is the console GUI doing? The web consoles for these services parse the paths and split on the / to give you the illusion of directories. That's why you see the behavior "non gui created folders automatically gets deleted if all children blobs are removed". It's because those folders weren't real and never existed, not because they get deleted when you delete the file.
Then, why the behavior when you use the "Create folder" action? The only way for the console GUI to know to show you a folder at somedir/ before you create any objects with a name somedir/{whatever} is if it creates a dummy placeholder object named somedir/. This also means that @fafnirZ's use of "children objects" is incorrect. Two objects somedir/ and somedir/myfile.txt are two sibling objects in a flat object store.
So, at a basic level, cloudpathlib is doing the correct naive thing because it detects there is a file in your bucket named somedir/. This is understandably confusing where cloudpathlib doesn't match the behavior of the console GUI, which has some logic by which it pretends that these are folders. We probably need to figure out what heuristic each cloud's console GUI is using to treat the dummy objects as folders and copy it. A quick fix like "check if there's a trailing /" may not be safe because a user can legitimately create real objects named something/ that they want to treat as a file.
See also #51 for discussion about this and why this is not straightforward.