kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

gridftp

Open tinaok opened this issue 2 years ago • 3 comments

Hello,

We are trying to use a small subset of CMIP6 data from ESGF server. They expose their NetCDF files in different ways

from pyesgf.search import SearchConnection
server='https://esgf-data.dkrz.de/esg-search'
conn = SearchConnection(server, distrib=True)
source_id='CMCC-CM2-HR4'
activity_id='OMIP'
experiment_id='omip2'
variable_id='vmo'
ctx = conn.new_context(
    project='CMIP6',
    source_id=source_id,
    experiment_id=experiment_id,
    variable=variable_id,
    frequency='mon',
)
result = ctx.search()[0]
files = result.file_context().search()
files[35].urls

Which gives

defaultdict(list,
            {'HTTPServer': [('http://esgf-node2.cmcc.it/thredds/fileServer/esg_dataroot/CMIP6/OMIP/CMCC/CMCC-CM2-HR4/omip2/r1i1p1f1/Omon/vmo/gn/v20200226/vmo_Omon_CMCC-CM2-HR4_omip2_r1i1p1f1_gn_200801-201812.nc',
               'application/netcdf')],
             'GridFTP': [('gsiftp://esgf-node2.cmcc.it:2811//esg_dataroot/CMIP6/OMIP/CMCC/CMCC-CM2-HR4/omip2/r1i1p1f1/Omon/vmo/gn/v20200226/vmo_Omon_CMCC-CM2-HR4_omip2_r1i1p1f1_gn_200801-201812.nc',
               'application/gridftp')],
             'OPENDAP': [('http://esgf-node2.cmcc.it/thredds/dodsC/esg_dataroot/CMIP6/OMIP/CMCC/CMCC-CM2-HR4/omip2/r1i1p1f1/Omon/vmo/gn/v20200226/vmo_Omon_CMCC-CM2-HR4_omip2_r1i1p1f1_gn_200801-201812.nc.html',
               'application/opendap-html')],
             'Globus': [('globus:4101e3a0-b7df-11eb-a16a-5fad80e6400b/esg_dataroot/CMIP6/OMIP/CMCC/CMCC-CM2-HR4/omip2/r1i1p1f1/Omon/vmo/gn/v20200226/vmo_Omon_CMCC-CM2-HR4_omip2_r1i1p1f1_gn_200801-201812.nc',
               'Globus')]})

We just need small subset of netcdf file, and I would like to make kerchunk catalogue of it. I can use the HTTPServer link to transform it to kerchunk catalogue, but just out of curiosity, can it also handle 'ftp' or 'open dap' or 'gridftp' ?

tinaok avatar Oct 21 '22 14:10 tinaok

  • yes, fsspec handles FTP, but it is generally a terrible protocol and I wouldn't expect it to work well
  • I don't know anything about gridftp
  • the opendap URL is HTML information from which you can construct an HTTP link to fetch parts of the target file. I don't think there's enough information there to be able to make a kerchunk reference set without scanning the target netcdf file. This effectively implements the "lazy" part of what kerchunk offers. I think this is backed by a server, unlike kerchunk, whose output is server independent.

martindurant avatar Oct 21 '22 14:10 martindurant

I agree with Tina that being able to support GridFTP would be very nice. GridFTP is well known is some projects (such as Large Hadron Collider or ESGF) and it is a part of the few high- performance data transfer tools.

annefou avatar Nov 15 '22 12:11 annefou

I have had a brief look around, and I can find one example of a python gridftp client, which is very old. Presumably, an fsspec backend could be built for it, and that would enable kerchunk and other remote access from python. However, most of what I find seems to refer specifically to Globus, as opposed to general gridftp, in which case presumably https://globus-sdk-python.readthedocs.io/en/stable/ provides everything needed (it looks very complicated!). In any case, the fsspec backend would require development, sorry.

Is anyone using IPFS or other similar technologies? IPFS does already have an fsspec implementation (ipfsspec).

martindurant avatar Nov 16 '22 01:11 martindurant