[Bug]: `xgcm` vertical regridder throws error `ValueError: dimension temp_unique on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension.`
What happened?
I am copying the example from "Remap cloud fraction from model hybrid coordinate to pressure levels" to the SciPy notebook (PR #658) to serve as an example of vertical regridding.
I download the datasets from ESGF via http and opened them using xc.open_dataset()/xc.open_mfdataset(). regridder.vertical() raises ValueError: dimension temp_unique on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension..
This error is not raised when opening the datasets from ESGF via OPeNDAP (refer to MVCE).
What did you expect to happen? Are there are possible answers you came across?
Vertically regridding local .nc datasets should work the same as using the remote OPeNDAP dataset.
I'm not sure why there is a difference in behavior.
Minimal Complete Verifiable Example (MVCE)
# %%
import numpy as np
import xcdat as xc
# Build hybrid pressure coordinate
def hybrid_coordinate(p0, a, b, ps, **kwargs):
return a * p0 + b * ps
# %%
## 1. Using dataset downloaded from ESGF via http -- raises ValueError
# Download link: https://esgf-data2.llnl.gov/thredds/fileServer/user_pub_work/CMIP6/CMIP/E3SM-Project/E3SM-2-0/historical/r1i1p1f1/Amon/cl/gr/v20220830/cl_Amon_E3SM-2-0_historical_r1i1p1f1_gr_185001-189912.nc
LOCAL_URL = "cl_Amon_E3SM-2-0_historical_r1i1p1f1_gr_185001-189912.nc"
ds_cl = xc.open_dataset(LOCAL_URL, chunks={"time": 4})
output_grid = xc.create_grid(z=xc.create_axis("lev", np.linspace(100000, 1, 13)))
pressure = hybrid_coordinate(**ds_cl.data_vars)
new_pressure_grid = xc.create_grid(z=xc.create_axis("lev", np.linspace(100000, 1, 13)))
output_cl = ds_cl.regridder.vertical(
"cl", new_pressure_grid, method="linear", target_data=pressure
)
output_cl.compute()
# %%
## 2. Using dataset from ESGF OPeNDAP -- works fine.
REMOTE_URL = "https://esgf-data2.llnl.gov/thredds/dodsC/user_pub_work/CMIP6/CMIP/E3SM-Project/E3SM-2-0/historical/r1i1p1f1/Amon/cl/gr/v20220830/cl_Amon_E3SM-2-0_historical_r1i1p1f1_gr_185001-189912.nc"
ds_cl = xc.open_dataset(REMOTE_URL, chunks={"time": 4})
output_grid = xc.create_grid(z=xc.create_axis("lev", np.linspace(100000, 1, 13)))
pressure = hybrid_coordinate(**ds_cl.data_vars)
new_pressure_grid = xc.create_grid(z=xc.create_axis("lev", np.linspace(100000, 1, 13)))
output_cl = ds_cl.regridder.vertical(
"cl", new_pressure_grid, method="linear", target_data=pressure
)
output_cl.compute()
Relevant log output
ValueError Traceback (most recent call last)
Cell In[31], line 1
----> 1 ds_vr = ds_3d.regridder.vertical(
2 "cl", new_pressure_grid, method="linear", target_data=pressure_data
3 )
File ~/repositories/xcdat/xcdat/regridder/accessor.py:281, in RegridderAccessor.vertical(self, data_var, output_grid, tool, **options)
273 input_grid = _get_input_grid(
274 self._ds,
275 data_var,
(...)
278 ],
279 )
280 regridder = regrid_tool(input_grid, output_grid, **options)
--> 281 output_ds = regridder.vertical(data_var, self._ds)
283 return output_ds
File ~/repositories/xcdat/xcdat/regridder/xgcm.py:179, in XGCMRegridder.vertical(self, data_var, ds)
174 target_data = None
176 # assumes new verical coordinate has been calculated and stored as `pressure`
177 # TODO: auto calculate pressure reference http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#parametric-v-coord
178 # cf_xarray only supports two ocean s-coordinate and ocean_sigma_coordinate
--> 179 output_da = grid.transform(
180 ds[data_var],
181 "Z",
182 output_coord_z,
183 target_data=target_data,
184 method=self._method,
185 **self._extra_options,
186 )
188 # when the order of dimensions are mismatched, the output data will be
189 # transposed to match the input dimension order
190 if output_da.dims != ds[data_var].dims:
File /opt/miniconda3/envs/xcdat_scipy_2024/lib/python3.11/site-packages/xgcm/grid.py:2692, in Grid.transform(self, da, axis, target, **kwargs)
2608 """Convert an array of data to new 1D-coordinates along `axis`.
2609 The method takes a multidimensional array of data `da` and
2610 transforms it onto another data_array `target_data` in the
(...)
2688
2689 """
2691 ax = self.axes[axis]
-> 2692 return ax.transform(da, target, **kwargs)
File /opt/miniconda3/envs/xcdat_scipy_2024/lib/python3.11/site-packages/xgcm/grid.py:1080, in Axis.transform(self, da, target, target_data, method, mask_edges, bypass_checks, suffix)
1076 if method == "linear" or method == "log":
1077 target, target_dim, target_data = _parse_target(
1078 target, target_dim, dim, target_data
1079 )
-> 1080 out = linear_interpolation(
1081 da,
1082 target_data,
1083 target,
1084 dim,
1085 dim, # in this case the dimension of phi and theta are the same
1086 target_dim,
1087 mask_edges=mask_edges,
1088 bypass_checks=bypass_checks,
1089 logarithmic=(method == "log"),
1090 )
1091 elif method == "conservative":
1092 # the conservative method requires `target_data` to be on the `outer` coordinate.
1093 # If that is not the case (a very common use case like transformation on any tracer),
1094 # we need to infer the boundary values (using the interp logic)
1095 # for this method we need the `outer` position. Error out if its not there.
1096 try:
File /opt/miniconda3/envs/xcdat_scipy_2024/lib/python3.11/site-packages/xgcm/transform.py:209, in input_handling.<locals>.wrapper_input_handling(*args, **kwargs)
207 # Execute function with temporary names
208 args = (phi, theta, target_theta_levels, temp_dim2, theta_dim, temp_dim)
--> 209 value = func(*args, **kwargs)
211 # rename back to original name
212 value = value.rename({temp_dim: target_dim})
File /opt/miniconda3/envs/xcdat_scipy_2024/lib/python3.11/site-packages/xgcm/transform.py:227, in linear_interpolation(phi, theta, target_theta_levels, phi_dim, theta_dim, target_dim, **kwargs)
223 @input_handling
224 def linear_interpolation(
225 phi, theta, target_theta_levels, phi_dim, theta_dim, target_dim, **kwargs
226 ):
--> 227 out = xr.apply_ufunc(
228 interp_1d_linear,
229 phi,
230 theta,
231 target_theta_levels,
232 kwargs=kwargs,
233 input_core_dims=[[phi_dim], [theta_dim], [target_dim]],
234 output_core_dims=[[target_dim]],
235 exclude_dims=set((phi_dim, theta_dim)),
236 dask="parallelized",
237 output_dtypes=[phi.dtype],
238 )
239 return out
File /opt/miniconda3/envs/xcdat_scipy_2024/lib/python3.11/site-packages/xarray/core/computation.py:1266, in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, meta, dask_gufunc_kwargs, on_missing_core_dim, *args)
...
775 )
776 dask_gufunc_kwargs["allow_rechunk"] = True
778 output_sizes = dask_gufunc_kwargs.pop("output_sizes", {})
ValueError: dimension temp_unique on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single array chunk along this dimension, i.e., ``.chunk(dict(temp_unique=-1))``, or pass ``allow_rechunk=True`` in ``dask_gufunc_kwargs`` but beware that this may significantly increase memory usage.
Anything else we need to know?
No response
Environment
xcdat=0.7.1 and main
xarray=2024.5.0
xgcm=0.8.1
Hey @jasonb5 can you take a look at this by our next meeting on 6/17 (Wed)?
@tomvothecoder Just got around to this issue, I've identified the problem, but not sure of the cause. Basically calling open_dataset with a remote URL and local path cause the file to be chunked differently, even when chunks={"time": 4} is used. Remotely the file is chunked as (4, 72, 180, 360) which is correct, but locally the file is chunked as (4, 36, 90, 180). Since it's chunked along lev, lat, and lon, xgcm errors because those dimensions cannot be chunked as they are "core" or dependent. The simple fix was to add ds_cl = ds_cl.chunk({"time": 4, "lev": 72, "lat": 180, "lon": 360}) causing the file to be chunked in a way xgcm can handle.
I'm going to look at open_dataset and determine where the issue is there, might be an xarray issue though. I'll update once I have more.
@tomvothecoder Found the cause of the issue. Basically xarray implemented preferred_chunks for the netcdf4 backend. For non-contiguous files the the encoded chunking is used and merged with the desired chunking. In this case the encoded chunking is (1, 36, 90, 180) and the desired chunking was (4, , , ) when this was merged it resulted in (4, 36, 90, 180) for the chunk size. Seems the best way if you don't know whether the source is contiguous or not is to just specify all dimensions in the chunks argument.
The reason the OPeNDAP url works, it's a completely different backend that utilizes pydap. This backend doesn't supported preferred_chunks and considers the file one whole chunk and the updates this with the desired chunking.
Thank you @jasonb5. It's great to know how Xarray behaves with pre-chunked datasets and specifying an incompatible chunks argument.
The datasets used in our notebooks will be updated in PR #705, so this won't be an issue anymore.