opendal icon indicating copy to clipboard operation
opendal copied to clipboard

new feature: Efficient Client-Side Glob Implementation via Guided Traversal

Open chitralverma opened this issue 3 months ago • 13 comments

Feature Description

The glob_support RFC (apache/opendal#6209) proposes a client-side fallback for services that don't support native globbing. The default implementation for this would be to list all entries recursively and then filter them, which can be inefficient for large directories. This issue proposes a more advanced client-side implementation that avoids these inefficiencies.

Solution Description I propose implementing the client-side glob functionality using a "Guided Traversal" algorithm. This approach is significantly more efficient than a simple "list-then-filter" strategy, especially for remote object stores where API calls are expensive.

The core idea is to parse the glob pattern into segments and use these segments to intelligently guide the directory traversal step-by-step, pruning the search space at each level.

Problem and Solution

How it Works For a pattern like **/*.jpg starting from a base_path s3://bucket/media/:

  1. Parse Pattern: The pattern is broken down into segments: ['media', '**', '*.jpg'].
  2. Stateful Traversal: A queue manages the search state, holding (path_to_search, remaining_segments) tuples.
  3. Guided API Calls: Instead of one large recursive list, the algorithm performs a series of targeted, single-level list calls.
    • It first lists the base_path to find a directory named media.
    • Once inside media/, it handles the ** by recursively exploring subdirectories, looking for entries that could satisfy the next segment (*.jpg).
    • At each level of the ** traversal, it attempts to match *.jpg, effectively pruning branches that don't contain matching files.

This turns the glob operation from a brute-force listing into an intelligent, stateful search that minimizes API calls and data transfer.

Complexity Analysis

  • Time Complexity: O(N * L), where N is the number of relevant entries actually visited and L is the average path length. N is kept to a minimum through aggressive pruning.

  • Space Complexity: O(D), where D is the maximum depth of the search. Memory usage is low as it processes entries as a stream and does not hold the full list in memory.

Additional Context

This guided approach is a significant improvement over the glob implementation found in other popular data access libraries like Python's fsspec. The fsspec implementation follows the "list-then-filter" model, which collects all possible paths into a list, sorts them (O(M log M) complexity), and then filters. This can lead to high memory usage and slow performance on large directories. The proposed guided traversal method avoids these bottlenecks entirely.

I have a proof-of-concept implementation in Rust that demonstrates this logic, producing a lazy Stream of entries, which aligns perfectly with the proposed lister_with().glob() API mentioned in the RFC. This can serve as a strong foundation for a PR.

Are you willing to contribute to the development of this feature?

  • [x] Yes, I am willing to contribute to the development of this feature.

chitralverma avatar Aug 28 '25 05:08 chitralverma

@asukaminato0721 @Xuanwo any suggestions or guidance on this, shall I proceed with a PR if this looks good to you?

chitralverma avatar Sep 04 '25 13:09 chitralverma

LGTM! Welcome to open a PR for this.

I think this will be good to live in a GlobLayer that users can opt-in this behavior.

By the way, we don't currently have glob support. Would you like to take on implementing full glob support at the same time? I'd like to see some native service support first to understand how it works, set up integration tests, and then proceed with our own implementation.

Xuanwo avatar Sep 04 '25 16:09 Xuanwo

Would you like to take on implementing full glob support at the same time

Yes, I was thinking of doing this for core and python bindings in 1 PR

I'd like to see some native service support first to understand how it works

I might be wrong here, but from most of the services supported by opendal,

  • Only Fs (local filesystem), HDFS (via globStatus, but need to recheck because of this) and GCS (via explicit matchGlob) provide server-side globbing facilities.
  • S3/ MinIO/ COS/ OSS/ Azure Blob/ etc offer prefix/ delimiter listing only, so OpenDAL must list and then filter client-side to emulate glob.   
  • FTP/ SFTP/ WebDAV/ HTTP — behavior depends on server; don’t rely on standard glob support.
  • DB backends - no objects so, most like this will be a SQL LIKE/ RLIKE pushdown

So, I was thinking that we can do a "best effort" glob implementation in opendal directly like how fsspec did it, and later the service can simply override it if required.

chitralverma avatar Sep 04 '25 17:09 chitralverma

LGTM

asukaminato0721 avatar Sep 04 '25 23:09 asukaminato0721

Yes, I was thinking of doing this for core and python bindings in 1 PR

Really appreciated, but it would be better to split this into multiple PRs to reduce the review effort. I'd love for us to move quickly.

For example, we can add glob() in OpList first and capability support first.

So, I was thinking that we can do a "best effort" glob implementation in opendal directly like how fsspec did it, and later the service can simply override it if required.

The simulated logic in OpenDAL is quite complex to maintain. My expectation is that we can split these simulated operations, such as create_dir for S3 or stat a dir for S3, into separate layers, allowing users to enable only the ones they need. In the same way, we could implement the functionality in a GlobLayer so users can activate it only when required.

Xuanwo avatar Sep 05 '25 05:09 Xuanwo

Really appreciated, but it would be better to split this into multiple PRs to reduce the review effort. I'd love for us to move quickly.

Sure, will do.

The simulated logic in OpenDAL is quite complex to maintain. My expectation is that we can split these simulated operations, such as create_dir for S3 or stat a dir for S3, into separate layers, allowing users to enable only the ones they need. In the same way, we could implement the functionality in a GlobLayer so users can activate it only when required.

I wasn;t able to follow this completely, i thought the different operations were already part of the backends. Like the lister, stat for S3.

or is this related to opfs ?

chitralverma avatar Sep 05 '25 09:09 chitralverma

I wasn;t able to follow this completely, i thought the different operations were already part of the backends. Like the lister, stat for S3.

I mean the logic like https://github.com/apache/opendal/blob/b018ba94c256133f12905e0f95e1f79c0f562999/core/src/layers/complete.rs#L104-L190

This is the place where we implement simulation logic for all services that lack native support.

Xuanwo avatar Sep 05 '25 13:09 Xuanwo

Okay got it.

I'll add glob to the complete layer, will raise a PR soon

chitralverma avatar Sep 06 '25 06:09 chitralverma

Okay got it.

I'll add glob to the complete layer, will raise a PR soon

Sorry I didn't make it clear. I mean the CompleteLayer is so complicated that I wish we can split it in parts, like in different layers. So I'm thinking that we can build a GlobLayer that adding list with glob support for all services that don't have native support.

What do you think? Or you prefer to land it just in CompleteLayer?

Xuanwo avatar Sep 06 '25 09:09 Xuanwo

Ohh, i get it now.

I'll suggest adding glob to the Complete layer in one PR and then splitting the Complete layer in another and then adding to the py bindings in another one.

chitralverma avatar Sep 06 '25 09:09 chitralverma

Ohh, i get it now.

I'll suggest adding glob to the Complete layer in one PR and then splitting the Complete layer in another and then adding to the py bindings in another one.

Looks good. Let's rock!

Xuanwo avatar Sep 06 '25 09:09 Xuanwo

Hi @Xuanwo , i was working on this today and noticed that adding glob to just complete.rs is not enough, it needs to be add to layer.rs and accessor.rs as well. which leads makes some other layer complain, where glob must be added as well. before making all the changes, just wanted to check with you if this is fine?

Image

chitralverma avatar Sep 29 '25 18:09 chitralverma

i was working on this today and noticed that adding glob to just complete.rs is not enough, it needs to be add to layer.rs and accessor.rs as well. which leads makes some other layer complain, where glob must be added as well. before making all the changes, just wanted to check with you if this is fine?

I'm interested in why, since I used to think it wouldn't touch other APIs.

Xuanwo avatar Sep 30 '25 17:09 Xuanwo