new feature: Efficient Client-Side Glob Implementation via Guided Traversal
Feature Description
The glob_support RFC (apache/opendal#6209) proposes a client-side fallback for services that don't support native globbing. The default implementation for this would be to list all entries recursively and then filter them, which can be inefficient for large directories. This issue proposes a more advanced client-side implementation that avoids these inefficiencies.
Solution Description I propose implementing the client-side glob functionality using a "Guided Traversal" algorithm. This approach is significantly more efficient than a simple "list-then-filter" strategy, especially for remote object stores where API calls are expensive.
The core idea is to parse the glob pattern into segments and use these segments to intelligently guide the directory traversal step-by-step, pruning the search space at each level.
Problem and Solution
How it Works
For a pattern like **/*.jpg starting from a base_path s3://bucket/media/:
- Parse Pattern: The pattern is broken down into segments:
['media', '**', '*.jpg']. - Stateful Traversal: A queue manages the search state, holding
(path_to_search, remaining_segments)tuples. - Guided API Calls: Instead of one large recursive list, the algorithm performs a series of targeted, single-level
listcalls.- It first lists the
base_pathto find a directory namedmedia. - Once inside
media/, it handles the**by recursively exploring subdirectories, looking for entries that could satisfy the next segment (*.jpg). - At each level of the
**traversal, it attempts to match*.jpg, effectively pruning branches that don't contain matching files.
- It first lists the
This turns the glob operation from a brute-force listing into an intelligent, stateful search that minimizes API calls and data transfer.
Complexity Analysis
-
Time Complexity:
O(N * L), whereNis the number of relevant entries actually visited and L is the average path length. N is kept to a minimum through aggressive pruning. -
Space Complexity:
O(D), whereDis the maximum depth of the search. Memory usage is low as it processes entries as a stream and does not hold the full list in memory.
Additional Context
This guided approach is a significant improvement over the glob implementation found in other popular data access libraries like Python's fsspec. The fsspec implementation follows the "list-then-filter" model, which collects all possible paths into a list, sorts them (O(M log M) complexity), and then filters. This can lead to high memory usage and slow performance on large directories. The proposed guided traversal method avoids these bottlenecks entirely.
I have a proof-of-concept implementation in Rust that demonstrates this logic, producing a lazy Stream of entries, which aligns perfectly with the proposed lister_with().glob() API mentioned in the RFC. This can serve as a strong foundation for a PR.
Are you willing to contribute to the development of this feature?
- [x] Yes, I am willing to contribute to the development of this feature.
@asukaminato0721 @Xuanwo any suggestions or guidance on this, shall I proceed with a PR if this looks good to you?
LGTM! Welcome to open a PR for this.
I think this will be good to live in a GlobLayer that users can opt-in this behavior.
By the way, we don't currently have glob support. Would you like to take on implementing full glob support at the same time? I'd like to see some native service support first to understand how it works, set up integration tests, and then proceed with our own implementation.
Would you like to take on implementing full glob support at the same time
Yes, I was thinking of doing this for core and python bindings in 1 PR
I'd like to see some native service support first to understand how it works
I might be wrong here, but from most of the services supported by opendal,
- Only Fs (local filesystem), HDFS (via globStatus, but need to recheck because of this) and GCS (via explicit matchGlob) provide server-side globbing facilities.
- S3/ MinIO/ COS/ OSS/ Azure Blob/ etc offer prefix/ delimiter listing only, so OpenDAL must list and then filter client-side to emulate glob.   
- FTP/ SFTP/ WebDAV/ HTTP — behavior depends on server; don’t rely on standard glob support.
- DB backends - no objects so, most like this will be a SQL
LIKE/RLIKEpushdown
So, I was thinking that we can do a "best effort" glob implementation in opendal directly like how fsspec did it, and later the service can simply override it if required.
LGTM
Yes, I was thinking of doing this for core and python bindings in 1 PR
Really appreciated, but it would be better to split this into multiple PRs to reduce the review effort. I'd love for us to move quickly.
For example, we can add glob() in OpList first and capability support first.
So, I was thinking that we can do a "best effort" glob implementation in opendal directly like how fsspec did it, and later the service can simply override it if required.
The simulated logic in OpenDAL is quite complex to maintain. My expectation is that we can split these simulated operations, such as create_dir for S3 or stat a dir for S3, into separate layers, allowing users to enable only the ones they need. In the same way, we could implement the functionality in a GlobLayer so users can activate it only when required.
Really appreciated, but it would be better to split this into multiple PRs to reduce the review effort. I'd love for us to move quickly.
Sure, will do.
The simulated logic in OpenDAL is quite complex to maintain. My expectation is that we can split these simulated operations, such as create_dir for S3 or stat a dir for S3, into separate layers, allowing users to enable only the ones they need. In the same way, we could implement the functionality in a GlobLayer so users can activate it only when required.
I wasn;t able to follow this completely, i thought the different operations were already part of the backends. Like the lister, stat for S3.
or is this related to opfs ?
I wasn;t able to follow this completely, i thought the different operations were already part of the backends. Like the lister, stat for S3.
I mean the logic like https://github.com/apache/opendal/blob/b018ba94c256133f12905e0f95e1f79c0f562999/core/src/layers/complete.rs#L104-L190
This is the place where we implement simulation logic for all services that lack native support.
Okay got it.
I'll add glob to the complete layer, will raise a PR soon
Okay got it.
I'll add glob to the complete layer, will raise a PR soon
Sorry I didn't make it clear. I mean the CompleteLayer is so complicated that I wish we can split it in parts, like in different layers. So I'm thinking that we can build a GlobLayer that adding list with glob support for all services that don't have native support.
What do you think? Or you prefer to land it just in CompleteLayer?
Ohh, i get it now.
I'll suggest adding glob to the Complete layer in one PR and then splitting the Complete layer in another and then adding to the py bindings in another one.
Ohh, i get it now.
I'll suggest adding glob to the Complete layer in one PR and then splitting the Complete layer in another and then adding to the py bindings in another one.
Looks good. Let's rock!
Hi @Xuanwo , i was working on this today and noticed that adding glob to just complete.rs is not enough, it needs to be add to layer.rs and accessor.rs as well. which leads makes some other layer complain, where glob must be added as well. before making all the changes, just wanted to check with you if this is fine?
i was working on this today and noticed that adding glob to just complete.rs is not enough, it needs to be add to layer.rs and accessor.rs as well. which leads makes some other layer complain, where glob must be added as well. before making all the changes, just wanted to check with you if this is fine?
I'm interested in why, since I used to think it wouldn't touch other APIs.