YAXArrays.jl
YAXArrays.jl copied to clipboard
Digesting intake catalogs
Hello!
I see a lot of tutorial that uses intake (Python package) for data loading. Especially for remote datasets such as CMIP6, ERA5, etc.. How hard it is to be able to load those datasets into YAXArrays.jl? I am not a specialist of intake, but if linked with YAXArrays.jl, it could potentially open up lots of Datasets around the world on cloud hosting. `
For example
https://hydrocloudservices.github.io/catalogs/notebooks/ipynb/atmosphere.html http://gallery.pangeo.io/repos/pangeo-gallery/cmip6/intake_ESM_example.html
edit - intake-xarray : https://intake-xarray.readthedocs.io/en/latest/index.html
Best regards!
You mean like here:
https://juliadatacubes.github.io/YAXArrays.jl/dev/examples/generated/Gallery/simplemaps/ https://juliadatacubes.github.io/YAXArrays.jl/dev/examples/generated/UserGuide/openZarr/
and currently I'm working a tutorial using ERA5 directly from the cloud store. [Probably a little bit more information into the tutorial will be nice to have.]
You mean like here:
https://juliadatacubes.github.io/YAXArrays.jl/dev/examples/generated/Gallery/simplemaps/ https://juliadatacubes.github.io/YAXArrays.jl/dev/examples/generated/UserGuide/openZarr/
and currently I'm working a tutorial using ERA5 directly from the cloud store. [Probably a little bit more information into the tutorial will be nice to have.]
Yes, like this, but I guess it simply a matter of building the gs/https/S3 link myself from the information in the yaml catalog. But that would mean basically rewritting the intake library and hoping to stay coherent with upstream features.
Perhaps I'm wondering how hard it would be to have something like 👍
using PyCall
using Zarr, YAXArrays
intake = pyimport("intake")
catalog_url = "https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs/main.yaml"
catalog=intake.open_catalog(catalog_url)
g=open_dataset(cat.atmosphere.era5_reanalysis_single_levels())
# where
(Julia) > cat.atmosphere.era5_reanalysis_single_levels()
PyObject sources:
era5_reanalysis_single_levels:
args:
consolidated: true
engine: zarr
storage_options:
anon: true
client_kwargs:
endpoint_url: https://s3.wasabisys.com
region_name: us-east-1
config_kwargs:
max_pool_connections: 30
urlpath:
- s3://era5/world/reanalysis/single-levels/zarr/timeseries/archive
description: ERA5 hourly estimates of variables on single levels chunked for time
series analysis
driver: intake_xarray.xzarr.ZarrSource
metadata:
catalog_dir: https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs
status:
- prod
tags:
- ocean
- model
- atmosphere
url: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels
I do have a "working" behaviour constructed by hand, but I am not sure how robust it is compared to standards. For example, based on the same catalogue:
using PyCall
using Zarr, YAXArrays
intake = pyimport("intake")
catalog_url = "https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs/main.yaml"
catalog=intake.open_catalog(catalog_url)
# Pointing directly to ERA5
datastore = cat.atmosphere.era5_reanalysis_single_levels()
# Building an https URL
Store = string("https://s3.",datastore[:storage_options]["client_kwargs"]["region_name"],".wasabisys.com/",datastore[:urlpath][1][6:end])
"https://s3.us-east-1.wasabisys.com/era5/world/reanalysis/single-levels/zarr/timeseries/archive"
g=open_dataset(zopen(Store, consolidated=true))
YAXArray Dataset
Dimensions:
longitude Axis with 1440 Elements from -180.0 to 179.75
latitude Axis with 721 Elements from 90.0 to -90.0
time Axis with 368184 Elements from 1979-01-01T00:00:00 to 2020-12-31T23:00:00
Variables: tp t2m
Properties: source => Reanalysis title => ERA5 forecasts institution => ECMWF
I see. Also, not sure about the complexity here. Maybe @meggart? has a clearer about this issue.
Has there been any progress towards a Julia equivalent of intake or would you recommend using PyCall instead?
I don't know of any work on an intake for Julia equivalent. I think that at the moment the way to go is to use PyCall as described above. @meggart was working to converting any xarray object to an YAXArrays but I am not sure how far this is ready.