kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Explain target_protocol and remote_protocol

Open rabernat opened this issue 3 years ago • 3 comments

I'm trying to understand fsspec-reference-maker better.

Consider the following code from the pangeo-forge hdf-reference tutorial:

m = fsspec.get_mapper(
    "reference://",
    fo=ref_url,
    target_protocol="file",
    remote_protocol="s3",
    skip_instance_cache=True,
)

Please help me understand the need for target_protocol and remote_protocol

target_protocol

AFAICT, this is the name of the protocol needed to open ref_url. Is this always needed? What if the protocol is already in ref_url, e.g. `ref_url = "https://..."? What if the two are inconsistent?

Why don't we just infer the target_protocol from ref_url?

remote_protocl

AFAICT, this is the protocol used for opening the underlying reference files. But that is already encoded in the reference file!!. Here, the beginning of ref_url is

{"version": 1, "templates": {"a": "s3://esgf-world/CMIP6/OMIP/NOAA-GFDL/GFDL-CM4/omip1/r1i1p1f1/Omon/thetao/gr/v20180701/thetao_Omon_GFDL-CM4_omip1_r1i1p1f1_gr_170801-172712.nc", "b": "s3://esgf-world/CMIP6/OMIP/NOAA-GFDL/GFDL-CM4/omip1/r1i1p1f1/Omon/thetao/gr/v20180701/thetao_Omon_GFDL-CM4_omip1_r1i1p1f1_gr_172801-174712.nc", "c": "s3://esgf-world/CMIP6/OMIP/NOAA-GFDL/GFDL-CM4/omip1/r1i1p1f1/Omon/thetao/gr/v20180701/thetao_Omon_GFDL-CM4_omip1_r1i1p1f1_gr_174801-176712.nc", "d": "s3://esgf-world/CMIP6/OMIP/NOAA-GFDL/GFDL-CM4/om ...

All of those s3://s mean that the data should be read with s3 protocol. So why do we also need to specify remote_protocol? What if I put remote_protocol='gcs' but the actual references are s3? Wouldn't this cause problems?


In summary, it feels to me like both these specifiers are redundant and therefore a source of potential bugs. But I'm sure I'm missing something.

In any case, the documentation on these options could be improved.

rabernat avatar Sep 28 '21 00:09 rabernat

Your understanding of the two protocols is correct. Perhaps indeed they could be considered redundant. However:

  • the target or remote URLs can be provided without prefix
  • there are not necessarily any templates, so would have to find prefix from references
  • the protocol and URL might be different when you want chained FSs, such as caching (but can also do this via the storage options)

The three points are each pretty weak, so simplicity might argue for removal. Note that I do want to eventually allow the remote to be more than one filesystem, but that could get complicated (especially if one is async and one is not).

martindurant avatar Sep 28 '21 14:09 martindurant

I think, from a user perspective, it might be best to make remote_protocol (and maybe target_protocol) kwargs and attempt to parse the URL, and warn/error if we're not able to determine the protocol from the URL

lsterzinger avatar Sep 30 '21 18:09 lsterzinger

Agreed. In fsspec.open, protocol is optional, and None uses parsing by default. That can be the default.

martindurant avatar Oct 01 '21 17:10 martindurant