xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Slow lazy performance on cloud data

Open shanicetbailey opened this issue 1 year ago • 2 comments

Hi, I am not sure if this is the place to raise my issue but I'd appreciate any help!

I am trying to do a more complicated calculation with CESM cloud data (on pangeo cloud deployment) and am running into an issue on a simpler calculation as part of the workflow. In the process of taking the derivative the cell takes a very long time to run when differencing - even though this step is not even computing anything. It should run quickly but as you can see from the screen shot, the cell takes a long time to run. It shows runtime is ~20s but wall time is much longer (~2min). This becomes a serious issue when trying to take the derivative of multiple variables part of a larger workflow. @jbusecke and I replicated the differencing problem on a randomized dask dataset and, as you can see, the cell takes a much quicker time to run. Below I have pasted reproducible code that isolates the problem. I am not sure how to proceed on fixing this slow performance and would appreciate your help, thanks!

import xarray as xr
import numpy as np
import dask.array as dsa
import pop_tools
from xgcm import Grid
import xgcm
from intake import open_catalog

# Dask sample dataset

test_values = dsa.random.random((14695, 2400, 3600), chunks=(1, 2400, 3600))
da_sample = xr.DataArray(test_values, dims=['time', 'x', 'y'])
da_sample_u = xr.DataArray(test_values, dims=['time', 'x_u', 'y_u'])
ds_sample = xr.Dataset(data_vars=dict(test_values=da_sample, u=da_sample_u))

%timeit ds_sample.pad({'nlon':(2,2)}).diff('nlon')

# Original dataset

url = "https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean/CESM_POP.yaml"
cat = open_catalog(url)
ds = cat["CESM_POP_hires_control"].to_dask()
ds = ds.drop([d for d in ds.dims if d in ds.coords])

%timeit ds.pad({'nlon':(2,2)}).diff('nlon')
Screen Shot 2022-07-29 at 13 00 18

shanicetbailey avatar Jul 29 '22 17:07 shanicetbailey

Can't access that catalog unfortunately:

 Bad Request: https://storage.googleapis.com/download/storage/v1/b/pangeo-cesm-pop/o/control%2F.zmetadata?alt=media
User project specified in the request is invalid.

How many variables are in ds. You're diagnosing the graph construction time in that %timeit statement. This will scale with number of variables.

dcherian avatar Jul 29 '22 17:07 dcherian

This should be fully reproducible in the pangeo cloud deployment, but is unfortunately only available as 'requester pays' for other local machines.=

How many variables are in ds. You're diagnosing the graph construction time in that %timeit statement. This will scale with number of variables.

Ahh good catch. @shanicetbailey can you try to drop all but two variables from the cloud dataset (ds[['SST', 'U']]) and check again?

jbusecke avatar Jul 29 '22 18:07 jbusecke

Thanks for the report @shanicetbailey . We can't really help you at the moment, so I'm closing this issue. Please reopen when you have more information.

dcherian avatar Sep 12 '22 18:09 dcherian