cloudpathlib Speed up `is_dir` on entries returned by `iterdir` and `glob`

The is_dir check is fairly expensive, but at least for S3 and Azure when the entries were created as a result of the client's _list_dir method, you can tell for each entry whether it is a directory or a file and immediately set the result on the created CloudPath instance.

For example for the S3Client._list_dir, you could write something like:

            paginator = self.client.get_paginator("list_objects_v2")

            for result in paginator.paginate(
                Bucket=cloud_path.bucket, Prefix=prefix, Delimiter="/", MaxKeys=1000
            ):

                # sub directory names
                for result_prefix in result.get("CommonPrefixes", []):
                    path = S3Path(f"s3://{cloud_path.bucket}/{result_prefix.get('Prefix')}")
                    path._is_dir = True
                    yield path

                # files in the directory
                for result_key in result.get("Contents", []):
                    path = S3Path(f"s3://{cloud_path.bucket}/{result_key.get('Key')}")
                    path._is_dir = False
                    yield path

and modify S3Path.is_dir:

    def is_dir(self) -> bool:
        if self._is_dir is None:
            self._is_dir = self.client._is_file_or_dir(self) == "dir"
        return self._is_dir

This makes a HUGE performance difference if you need to call is_dir on the entries returned from iterdir or glob (in my case, when implementing a file dialog that works for cloud paths).

Not sure if this particular implementation is the best way to do this, but something like this is needed.

Oct 06 '21 20:10 analog-cbarber

It's a good idea to cache the _is_file_or_dir result for the life of the CloudPath object—I do think it's unlikely to change from one to the other in normal scenarios.

As you point out, there are probably points in the code where we can populate that cache explicitly when we create the CloudPath object since we know at that point what it is.

I think we'd definitely take a PR that had this improvement with tests and a set of benchmarks showing improved timing and also reduced network calls.

Oct 07 '21 03:10 pjbull

I think we'll perhaps want to write down some principles of how we expect this caching to work. Cache invalidation is classically tricky, so we're going to want to be clear what categories of things get cached, how long it gets cached (lifetime of CloudPath object introduces a new concept not dealt with before in either cloudpathlib or regular pathlib), how to invalidate the cache, etc.

Oct 07 '21 14:10 jayqi

The simplest thing to do would be to provide ways in the public API to clear or ignore the cached value. I would start out with that. E.g.:

Add a sync keyword on the is_dir, is_file and exists that can be used to force the call to sync.
Add a file_type property on CloudPath that can be 'file', 'dir', 'missing', or None (or perhaps use an enum) and allow users to set it to None.
Add a clear_cached method or the like to CloudPath.

If you want to get fancy, you could just add an expiration time along with suitable methods for controlling the duration. I think I would probably go with the easy way first.

This really is an important optimization because doing individual file/directory checks on thousands of entries in a large data directory kills performance.

Oct 07 '21 14:10 analog-cbarber

BTW, it probably would also be a good idea to cache other information returned in the directory entries that could be used to populate the stat() call, e.g. the file size and last modified time and perhaps the etag as well.

Oct 07 '21 15:10 analog-cbarber

Note that methods that create an entry should make sure to fix the cached state appropriately.

Oct 07 '21 18:10 analog-cbarber

You might want to instead cache this information in the client object in a weak dictionary.

Oct 07 '21 18:10 analog-cbarber