potential performance improvement for GSPath globbing capabilities
Hey, I've been using the GSPath globbing capabilities to glob over a fairly large GCS bucket (couple of gbs) and have been noticing that it takes a lot longer to process compared to a google-cloud-storage implementation.
list_blobs(match_glob="**/version_1/**")
Furthermore, when having task manager open when performing a glob on the bucket I observe significantly higher network footprint (when using GSPath) in comparison to the list_blobs implementation.
My guess is that cloudpathlib may potentially be sending more network request than necessary (correct me if I'm wrong)
Any reasons why we don't just leverage the match_glob arg for GSPath's glob capabilities?
GCloud SDK list_blobs(match_glob="") reference below:
https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/bucket.py#L1407
a GSPath("/path/to/folder/").glob("**/version_1/**) to my belief can be translated to list_blobs by doing the following:
list_blobs(prefix="/path/to/folder", match_glob="**/version_1/**")
happy to submit something if you would like this change incorporated :)
Thanks @fafnirZ for the thoughts.
I think there are comparable speed improvements for most backends. The primary reason we don't use the SDK/API glob functionality in these scenarios is that it is hard to guarantee identical behavior for complex patterns for Path and CloudPath if we do.
At the moment, we list everything and then use the pathlib glob implementation directly by creating a shim, which does result in more resource usage.
The ideal scenario is that we had a fully specified set of behaviors and test cases for glob so that we can ensure patterns for Path objects work the same for CloudPath. If we had that and were confident in the coverage, I think we could look at optimized implementations that use the API/SDK in place of the pathlib glob logic.
Fair enough, thanks for explaining!
Just wanted to add in that I'm discovering similar performance challenges with AWS S3 globbing through cloudpathlib. I love cloudpathlib - it's a valuable addition to our work. That said, the glob performance critically underperforms to the point where it freezes programs if there are too many files involved.
Consider this an upvote towards API/SDK-specific implementations for optimization (if / when possible)!