Speed difference when lazy loading ERA5 zarr v3 data

Open songhan89 opened this issue 8 months ago • 1 comments

Hi,

I noticed that there is a significant difference in speed lazy loading the ERA5 zarr v3 data from Google. Just curious to know why is there such a big difference ?

Loading with Dask

This took <2 min to load

ds = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3',
    consolidated=False, 
    chunks={}, 
    storage_options=dict(token='anon')
)

Loading with Cubed executor

This took 6 min to load. I tried the same with Spark Executor as well and result was the same.


spec_local = cubed.Spec(
    executor='processes',
    work_dir='/tmp/',
    allowed_mem="4GB"
)

ds = xr.open_zarr(
    'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3',
    consolidated=False, 
    chunks={}, 
    chunked_array_type="cubed",
    from_array_kwargs={'spec': spec_local},
    storage_options=dict(token='anon')
)

Apr 19 '25 16:04 songhan89

Thank for opening this issue @songhan89. It's not immediately obvious why this is slower - it will need a bit of investigation.

Apr 21 '25 14:04 tomwhite