YAXArrays.jl icon indicating copy to clipboard operation
YAXArrays.jl copied to clipboard

Digesting intake catalogs

Open Balinus opened this issue 2 years ago • 6 comments

Hello!

I see a lot of tutorial that uses intake (Python package) for data loading. Especially for remote datasets such as CMIP6, ERA5, etc.. How hard it is to be able to load those datasets into YAXArrays.jl? I am not a specialist of intake, but if linked with YAXArrays.jl, it could potentially open up lots of Datasets around the world on cloud hosting. `

For example

https://hydrocloudservices.github.io/catalogs/notebooks/ipynb/atmosphere.html http://gallery.pangeo.io/repos/pangeo-gallery/cmip6/intake_ESM_example.html

edit - intake-xarray : https://intake-xarray.readthedocs.io/en/latest/index.html

Best regards!

Balinus avatar Dec 13 '22 19:12 Balinus

You mean like here:

https://juliadatacubes.github.io/YAXArrays.jl/dev/examples/generated/Gallery/simplemaps/ https://juliadatacubes.github.io/YAXArrays.jl/dev/examples/generated/UserGuide/openZarr/

and currently I'm working a tutorial using ERA5 directly from the cloud store. [Probably a little bit more information into the tutorial will be nice to have.]

lazarusA avatar Dec 13 '22 19:12 lazarusA

You mean like here:

https://juliadatacubes.github.io/YAXArrays.jl/dev/examples/generated/Gallery/simplemaps/ https://juliadatacubes.github.io/YAXArrays.jl/dev/examples/generated/UserGuide/openZarr/

and currently I'm working a tutorial using ERA5 directly from the cloud store. [Probably a little bit more information into the tutorial will be nice to have.]

Yes, like this, but I guess it simply a matter of building the gs/https/S3 link myself from the information in the yaml catalog. But that would mean basically rewritting the intake library and hoping to stay coherent with upstream features.

Perhaps I'm wondering how hard it would be to have something like 👍

using PyCall
using Zarr, YAXArrays
intake = pyimport("intake")

catalog_url = "https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs/main.yaml"
catalog=intake.open_catalog(catalog_url)

g=open_dataset(cat.atmosphere.era5_reanalysis_single_levels())

# where 
(Julia) > cat.atmosphere.era5_reanalysis_single_levels()
PyObject sources:
  era5_reanalysis_single_levels:
    args:
      consolidated: true
      engine: zarr
      storage_options:
        anon: true
        client_kwargs:
          endpoint_url: https://s3.wasabisys.com
          region_name: us-east-1
        config_kwargs:
          max_pool_connections: 30
      urlpath:
      - s3://era5/world/reanalysis/single-levels/zarr/timeseries/archive
    description: ERA5 hourly estimates of variables on single levels chunked for time
      series analysis
    driver: intake_xarray.xzarr.ZarrSource
    metadata:
      catalog_dir: https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs
      status:
      - prod
      tags:
      - ocean
      - model
      - atmosphere
      url: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels

I do have a "working" behaviour constructed by hand, but I am not sure how robust it is compared to standards. For example, based on the same catalogue:


using PyCall
using Zarr, YAXArrays
intake = pyimport("intake")

catalog_url = "https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs/main.yaml"
catalog=intake.open_catalog(catalog_url)

# Pointing directly to ERA5
datastore = cat.atmosphere.era5_reanalysis_single_levels()

# Building an https URL
Store = string("https://s3.",datastore[:storage_options]["client_kwargs"]["region_name"],".wasabisys.com/",datastore[:urlpath][1][6:end])
"https://s3.us-east-1.wasabisys.com/era5/world/reanalysis/single-levels/zarr/timeseries/archive"

g=open_dataset(zopen(Store, consolidated=true))
YAXArray Dataset
Dimensions:
   longitude           Axis with 1440 Elements from -180.0 to 179.75
   latitude            Axis with 721 Elements from 90.0 to -90.0
   time                Axis with 368184 Elements from 1979-01-01T00:00:00 to 2020-12-31T23:00:00
Variables: tp t2m
Properties: source => Reanalysis title => ERA5 forecasts institution => ECMWF

Balinus avatar Dec 13 '22 20:12 Balinus

I see. Also, not sure about the complexity here. Maybe @meggart? has a clearer about this issue.

lazarusA avatar Dec 14 '22 15:12 lazarusA

Has there been any progress towards a Julia equivalent of intake or would you recommend using PyCall instead?

briochemc avatar May 06 '24 12:05 briochemc

I don't know of any work on an intake for Julia equivalent. I think that at the moment the way to go is to use PyCall as described above. @meggart was working to converting any xarray object to an YAXArrays but I am not sure how far this is ready.

felixcremer avatar May 06 '24 14:05 felixcremer