filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

batch downloads of mixed filesystems

Open matthewhanson opened this issue 3 years ago • 3 comments

Hello, I'm using fsspec with STAC metadata as a way to download assets in STAC Items in a standard way. These assets are most commonly http or s3 but could potentially be others.

A common scenario is objects in a requester pays bucket. In STAC though, we may have multiple assets we want to download where some are s3 in requester pays buckets and some aren't, or are http. See this Sentinel item as an example.

I'm also implementing this to support async downloads.

The problem is that I want to be able to download all assets from an item (or maybe even multiple items). So I currently loop through the assets, use url_to_fs to create the fs object and create an async task with that. But in the case of requester pays we need to specify that, but we can't pass requester_pays to url_to_fs when the URL ends up being https, as it throws an error.

The relevant code is here.

I'm looking for any suggestions on how to handle this without having to pre-determine the fs myself from the URL. url_to_fs is of great utility, but what I really want is for it to ignore keywords that aren't relevant for the resulting fs.

Thanks in advance for any thoughts!

matthewhanson avatar Oct 07 '22 19:10 matthewhanson

The generic filesystem's cat method will dispatch asynchronously on the URL protocols of a list of URLs to the right filesystem instance, and there are number of ways to specify exactly which instance for which protocol. Unfortunately, it does not yet have a get/download method, but looking at the code for _cat_file, it ought to be very simple to implement. This would not solve the case where you have multiple "s3" backends you want to use, unless you differentiate them by giving them a new protocol. In this case "s3a" would be a natural choice, since that is already an alias for "s3".

martindurant avatar Oct 07 '22 20:10 martindurant

Thanks @martindurant, good to know about cat. For now this works as long as I don't mix requester pays and public s3 URLs.

Is each fs instance from cat it's own session? url_to_fs must be opening a session for each one and I'm not really sure how to properly close them.

matthewhanson avatar Oct 08 '22 20:10 matthewhanson

Is each fs instance from cat its own session?

No, fsspec caches filesystem instances by default, so if you call filesystem() with the same arguments, you do not make a new instance, and keep any open connection pool. In the case of using cat with different filesystems via the generic filesystem, you will have an instance for each backend, which should have independent connection pools/sessions, but be running on the same event loop in the same thread.

martindurant avatar Oct 11 '22 14:10 martindurant