cloudpathlib
cloudpathlib copied to clipboard
Speed up `is_dir` on entries returned by `iterdir` and `glob`
The is_dir check is fairly expensive, but at least for S3 and Azure when the entries were created as a result of the client's _list_dir method, you can tell for each entry whether it is a directory or a file and immediately set the result on the created CloudPath instance.
For example for the S3Client._list_dir, you could write something like:
paginator = self.client.get_paginator("list_objects_v2")
for result in paginator.paginate(
Bucket=cloud_path.bucket, Prefix=prefix, Delimiter="/", MaxKeys=1000
):
# sub directory names
for result_prefix in result.get("CommonPrefixes", []):
path = S3Path(f"s3://{cloud_path.bucket}/{result_prefix.get('Prefix')}")
path._is_dir = True
yield path
# files in the directory
for result_key in result.get("Contents", []):
path = S3Path(f"s3://{cloud_path.bucket}/{result_key.get('Key')}")
path._is_dir = False
yield path
and modify S3Path.is_dir:
def is_dir(self) -> bool:
if self._is_dir is None:
self._is_dir = self.client._is_file_or_dir(self) == "dir"
return self._is_dir
This makes a HUGE performance difference if you need to call is_dir on the entries returned from iterdir or glob (in my case, when implementing a file dialog that works for cloud paths).
Not sure if this particular implementation is the best way to do this, but something like this is needed.
It's a good idea to cache the _is_file_or_dir result for the life of the CloudPath object—I do think it's unlikely to change from one to the other in normal scenarios.
As you point out, there are probably points in the code where we can populate that cache explicitly when we create the CloudPath object since we know at that point what it is.
I think we'd definitely take a PR that had this improvement with tests and a set of benchmarks showing improved timing and also reduced network calls.
I think we'll perhaps want to write down some principles of how we expect this caching to work. Cache invalidation is classically tricky, so we're going to want to be clear what categories of things get cached, how long it gets cached (lifetime of CloudPath object introduces a new concept not dealt with before in either cloudpathlib or regular pathlib), how to invalidate the cache, etc.
The simplest thing to do would be to provide ways in the public API to clear or ignore the cached value. I would start out with that. E.g.:
- Add a
synckeyword on theis_dir,is_fileandexiststhat can be used to force the call to sync. - Add a file_type property on CloudPath that can be 'file', 'dir', 'missing', or None (or perhaps use an enum) and allow users to set it to None.
- Add a
clear_cachedmethod or the like to CloudPath.
If you want to get fancy, you could just add an expiration time along with suitable methods for controlling the duration. I think I would probably go with the easy way first.
This really is an important optimization because doing individual file/directory checks on thousands of entries in a large data directory kills performance.
BTW, it probably would also be a good idea to cache other information returned in the directory entries that could be used to populate the stat() call, e.g. the file size and last modified time and perhaps the etag as well.
Note that methods that create an entry should make sure to fix the cached state appropriately.
You might want to instead cache this information in the client object in a weak dictionary.