kerchunk
kerchunk copied to clipboard
gridftp
Hello,
We are trying to use a small subset of CMIP6 data from ESGF server. They expose their NetCDF files in different ways
from pyesgf.search import SearchConnection
server='https://esgf-data.dkrz.de/esg-search'
conn = SearchConnection(server, distrib=True)
source_id='CMCC-CM2-HR4'
activity_id='OMIP'
experiment_id='omip2'
variable_id='vmo'
ctx = conn.new_context(
project='CMIP6',
source_id=source_id,
experiment_id=experiment_id,
variable=variable_id,
frequency='mon',
)
result = ctx.search()[0]
files = result.file_context().search()
files[35].urls
Which gives
defaultdict(list,
{'HTTPServer': [('http://esgf-node2.cmcc.it/thredds/fileServer/esg_dataroot/CMIP6/OMIP/CMCC/CMCC-CM2-HR4/omip2/r1i1p1f1/Omon/vmo/gn/v20200226/vmo_Omon_CMCC-CM2-HR4_omip2_r1i1p1f1_gn_200801-201812.nc',
'application/netcdf')],
'GridFTP': [('gsiftp://esgf-node2.cmcc.it:2811//esg_dataroot/CMIP6/OMIP/CMCC/CMCC-CM2-HR4/omip2/r1i1p1f1/Omon/vmo/gn/v20200226/vmo_Omon_CMCC-CM2-HR4_omip2_r1i1p1f1_gn_200801-201812.nc',
'application/gridftp')],
'OPENDAP': [('http://esgf-node2.cmcc.it/thredds/dodsC/esg_dataroot/CMIP6/OMIP/CMCC/CMCC-CM2-HR4/omip2/r1i1p1f1/Omon/vmo/gn/v20200226/vmo_Omon_CMCC-CM2-HR4_omip2_r1i1p1f1_gn_200801-201812.nc.html',
'application/opendap-html')],
'Globus': [('globus:4101e3a0-b7df-11eb-a16a-5fad80e6400b/esg_dataroot/CMIP6/OMIP/CMCC/CMCC-CM2-HR4/omip2/r1i1p1f1/Omon/vmo/gn/v20200226/vmo_Omon_CMCC-CM2-HR4_omip2_r1i1p1f1_gn_200801-201812.nc',
'Globus')]})
We just need small subset of netcdf file, and I would like to make kerchunk catalogue of it. I can use the HTTPServer link to transform it to kerchunk catalogue, but just out of curiosity, can it also handle 'ftp' or 'open dap' or 'gridftp' ?
- yes, fsspec handles FTP, but it is generally a terrible protocol and I wouldn't expect it to work well
- I don't know anything about gridftp
- the opendap URL is HTML information from which you can construct an HTTP link to fetch parts of the target file. I don't think there's enough information there to be able to make a kerchunk reference set without scanning the target netcdf file. This effectively implements the "lazy" part of what kerchunk offers. I think this is backed by a server, unlike kerchunk, whose output is server independent.
I agree with Tina that being able to support GridFTP would be very nice. GridFTP is well known is some projects (such as Large Hadron Collider or ESGF) and it is a part of the few high- performance data transfer tools.
I have had a brief look around, and I can find one example of a python gridftp client, which is very old. Presumably, an fsspec backend could be built for it, and that would enable kerchunk and other remote access from python. However, most of what I find seems to refer specifically to Globus, as opposed to general gridftp, in which case presumably https://globus-sdk-python.readthedocs.io/en/stable/ provides everything needed (it looks very complicated!). In any case, the fsspec backend would require development, sorry.
Is anyone using IPFS or other similar technologies? IPFS does already have an fsspec implementation (ipfsspec).