satpy icon indicating copy to clipboard operation
satpy copied to clipboard

Better handling of fsspec.filesystem in Scene

Open BENR0 opened this issue 3 years ago • 6 comments

Feature Request

Is your feature request related to a problem? Please describe. Currently multiple readers already support remote files and hopefully this number will rise in the future. While it is documented how to use remote filesystems with supported readers, from my point of view it is rather "complex" and relies on URL chaining.

Describe the solution you'd like I think it would be nice if a fsspec.filesystem could be handed to either the Scene itself or the the reader_kwargs too. Using GOES data as an example there are currently two options when an fsspec.spec.AbstractFileSystem is is used to find files.

fs = fsspec.filesystem("filecache", target_protocol="s3", target_options={"anon":True}, cache_storage=CACHE_DIRECTORY, same_names=True)

file_list = list(fs.glob("noaa-goes16/ABI-L1b-RadF/2021/244/00/*.nc"))

# need to specify storage options again
reader_kwargs = {"storage_options": {
    "s3": {"anon": True},
    "filecache": { "cache_storage": CACHE_DIRECTORY,
                   "same_names": True}
}}

# have to add protocol to file string again
file_list = ["filecache::s3//" + f for f in file_list]
scn = Scene(reader="abi_l1b", filenames=file_list, reader_kwargs=reader_kwargs)

or, which is already a little nicer, but not directly documented:

file_list = [FSFile(f, fs=fs) for f in file_list]
scn = Scene(reader="abi_l1b", filenames=file_list)

It would be nice if one of these two would work:


scn = Scene(reader="abi_l1b", filenames=file_list, fs=fs)

# or
scn = Scene(reader="abi_l1b", filenames=file_list, reader_kwargs={"fs":fs})

This would also be more in line with the logic proposed in PR# gerrits find filesandreaders.

Describe any changes to existing user workflow The current behaviour can obviously be conserved so no incompatibilites are expected.

Additional context Have you considered any alternative solutions or is there anything else that would help describe your request.

BENR0 avatar Aug 16 '22 10:08 BENR0

See if https://github.com/pytroll/satpy/pull/2096 provided something that is more usable for you? With it, this should work:

filenames = [
    'simplecache::s3://satellite-data-eumetcast-seviri-rss/H-000-MSG3*-202204260855-*',
]
scn = Scene(reader='seviri_l1b_hrit', filenames=filenames)
scn.load(list_of_datasets)

Some additional information is documented here: https://satpy.readthedocs.io/en/stable/remote_reading.html

pnuu avatar Aug 16 '22 11:08 pnuu

@pnuu yes that is exactly what I am refering to. If I don't have the file path as a URL chained string but instead use glob to find files with fsspec.filesystem currently I need to use one of the "work around" steps I described above. What I would like is some changes to the feature introduced in #2096 to be able to hand over a fsspec.filesystem and the list of files returned by the glob of this filesystem. Basically I propose to change line https://github.com/pytroll/satpy/blob/acd074530f1eeaa860c1d18de22ecbfd75be8166/satpy/utils.py#L657 to [FSFile(f, fs=fs) for f in fsspec_files] where fs is a fsspec.filesystem. The filesystem has to be handed over for example within the storage_options dict.

By the way opening the files before handing them over to FSFile is not strictly necessary (https://github.com/pytroll/satpy/blob/acd074530f1eeaa860c1d18de22ecbfd75be8166/satpy/utils.py#L656).

BENR0 avatar Aug 16 '22 13:08 BENR0

What if you don't use glob directly, but just pass the glob pattern as filename like I showed above?

pnuu avatar Aug 16 '22 13:08 pnuu

Yes of course that works but it still think this could be improved for a number of reasons:

  • it generally limits the use of other features of fsspec.filesystems (e.g. use find instead of glob, etc)
  • the Scene loading and the downloading of the data can't be decoupled
  • groupby's on the file list are not possible
  • arguments to the filesystem have to be passed as a dictionary, which is "counter intuitive syntax" to people familiar with fsspec.filesystem
  • the usage design is different from Satpy's own find_files_and_readers function which accepts an fs argument

Sorry if I am annoying about this. Is there any argument against also allowing passing a filesystem for example in reader_kwargs while still keeping the current behaviour for backwards compatibility?

BENR0 avatar Aug 18 '22 07:08 BENR0

@BENR0 This suggestion was actually very similar to what I originally proposed (pass a filesystem object down to the reader(s)) but was eventually overruled by Martin's idea to have an object that wraps this "stuff". The end goal for me and why I was OK losing this argument was for two reasons (at least that I can think of right now):

  1. It avoids more keyword arguments being added to all levels of Satpy's data loading and file handling. If we ever needed to handle some other type of input or some other type of protocol then that's more kwargs to all those levels. Or with FSFile it would be another subclass or another keyword arg to just that class.
  2. My hope was that with a few updates to fsspec in the future that satpy's FSFile could be replaced by fsspec's OpenFile or something similar: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.core.OpenFile

That doesn't mean that I disagree with what you're hoping for here, but I wanted to point out some of the reasons I remember.

djhoese avatar Aug 19 '22 01:08 djhoese

@djhoese yes I totally agree with point one. Even though I mentioned adding a fs keyword to the Scene as an option I personally don't like that idea that much exactly for the reason you mentioned. That's why I think, even though it might look a little "hacky", the "feature" I am looking for could be added without to much complication by adding the fs keyword to the `reader_kwargs" dict and expand the current logic implemented by https://github.com/pytroll/satpy/pull/2096 to account for that. This would keep it backwards compatible. So a user could hand over a filesystem or could use the current way of handing over the caching arguments.

As for point two I don't know what the timeline on something like that would be. I am not really involved and also know to little about fsspec development. Of course that would be even nicer though in the long run.

BENR0 avatar Aug 22 '22 08:08 BENR0