cubed
cubed copied to clipboard
Speed difference when lazy loading ERA5 zarr v3 data
Hi,
I noticed that there is a significant difference in speed lazy loading the ERA5 zarr v3 data from Google. Just curious to know why is there such a big difference ?
Loading with Dask
This took <2 min to load
ds = xr.open_zarr(
'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3',
consolidated=False,
chunks={},
storage_options=dict(token='anon')
)
Loading with Cubed executor
This took 6 min to load. I tried the same with Spark Executor as well and result was the same.
spec_local = cubed.Spec(
executor='processes',
work_dir='/tmp/',
allowed_mem="4GB"
)
ds = xr.open_zarr(
'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3',
consolidated=False,
chunks={},
chunked_array_type="cubed",
from_array_kwargs={'spec': spec_local},
storage_options=dict(token='anon')
)
Thank for opening this issue @songhan89. It's not immediately obvious why this is slower - it will need a bit of investigation.