xarray
xarray copied to clipboard
Slow lazy performance on cloud data
Hi, I am not sure if this is the place to raise my issue but I'd appreciate any help!
I am trying to do a more complicated calculation with CESM cloud data (on pangeo cloud deployment) and am running into an issue on a simpler calculation as part of the workflow. In the process of taking the derivative the cell takes a very long time to run when differencing - even though this step is not even computing anything. It should run quickly but as you can see from the screen shot, the cell takes a long time to run. It shows runtime is ~20s but wall time is much longer (~2min). This becomes a serious issue when trying to take the derivative of multiple variables part of a larger workflow. @jbusecke and I replicated the differencing problem on a randomized dask dataset and, as you can see, the cell takes a much quicker time to run. Below I have pasted reproducible code that isolates the problem. I am not sure how to proceed on fixing this slow performance and would appreciate your help, thanks!
import xarray as xr
import numpy as np
import dask.array as dsa
import pop_tools
from xgcm import Grid
import xgcm
from intake import open_catalog
# Dask sample dataset
test_values = dsa.random.random((14695, 2400, 3600), chunks=(1, 2400, 3600))
da_sample = xr.DataArray(test_values, dims=['time', 'x', 'y'])
da_sample_u = xr.DataArray(test_values, dims=['time', 'x_u', 'y_u'])
ds_sample = xr.Dataset(data_vars=dict(test_values=da_sample, u=da_sample_u))
%timeit ds_sample.pad({'nlon':(2,2)}).diff('nlon')
# Original dataset
url = "https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean/CESM_POP.yaml"
cat = open_catalog(url)
ds = cat["CESM_POP_hires_control"].to_dask()
ds = ds.drop([d for d in ds.dims if d in ds.coords])
%timeit ds.pad({'nlon':(2,2)}).diff('nlon')
![Screen Shot 2022-07-29 at 13 00 18](https://user-images.githubusercontent.com/31974425/181808665-ff87dc68-5c5d-421f-98e4-8b680d9e2c92.png)
Can't access that catalog unfortunately:
Bad Request: https://storage.googleapis.com/download/storage/v1/b/pangeo-cesm-pop/o/control%2F.zmetadata?alt=media
User project specified in the request is invalid.
How many variables are in ds
. You're diagnosing the graph construction time in that %timeit
statement. This will scale with number of variables.
This should be fully reproducible in the pangeo cloud deployment, but is unfortunately only available as 'requester pays' for other local machines.=
How many variables are in ds. You're diagnosing the graph construction time in that %timeit statement. This will scale with number of variables.
Ahh good catch. @shanicetbailey can you try to drop all but two variables from the cloud dataset (ds[['SST', 'U']]
) and check again?
Thanks for the report @shanicetbailey . We can't really help you at the moment, so I'm closing this issue. Please reopen when you have more information.