xarray
xarray copied to clipboard
Return a 3D object alongside 1D object in apply_ufunc
Is your feature request related to a problem?
Currently, I have something similar to this, where the input_lat
is transformed to new_lat
(here, +0.25, but in real use case, it's indeterministic).
Since xarray_ufunc
doesn't return a dataset with actual coordinates values, I had to return a second output to retain new_lat to properly update the coordinate values, but this second output is shaped time, lat, lon
so I have to ds["lat"] = new_lat.isel(lon=0, time=0).values
, which I think is inefficient; I simply need it to be shaped lat
.
Any ideas on how I can modify this to make it more efficient?
import xarray as xr
import numpy as np
air = xr.tutorial.open_dataset("air_temperature")["air"]
input_lat = np.arange(20, 45)
def interp1d_np(data, base_lat, input_lat):
new_lat = input_lat + 0.25
return np.interp(new_lat, base_lat, data), new_lat
ds, new_lat = xr.apply_ufunc(
interp1d_np, # first the function
air,
air.lat, # as above
input_lat, # as above
input_core_dims=[["lat"], ["lat"], ["lat"]], # list with one entry per arg
output_core_dims=[["lat"], ["lat"]], # returned data has one dimension
exclude_dims=set(("lat",)), # dimensions allowed to change size. Must be a set!
vectorize=True, # loop over non-core dims
)
new_lat = new_lat.isel(lon=0, time=0).values
ds["lat"] = new_lat
Describe the solution you'd like
Either be able to automatically assign the new_lat to the returned xarray object, or allow a 1D dataset to be returned
Describe alternatives you've considered
No response
Additional context
No response
If new_lat
is really 1-dimensional, meaning it doesn't vary by lat/lon point, why can't it be calculated or defined prior to running anything through apply_ufunc
, and then passed through as an input to your interp function?
It's indeterministic, or it's decided inside apply_ufunc
def _help_downsample(data, time, n_out):
indices = MinMaxLTTBDownsampler().downsample(time, data, n_out=n_out)
return data[indices], indices
def apply_downsample(ts_ds, n_out):
ts_ds_downsampled, indices = xr.apply_ufunc(
_help_downsample,
ts_ds["data"],
ts_ds["time"],
kwargs=dict(n_out=n_out),
input_core_dims=[["time"], ["time"]],
output_core_dims=[["time"], ["indices"]],
exclude_dims=set(("time",)),
vectorize=True,
dask="parallelized",
dask_gufunc_kwargs=dict(output_sizes={"time": n_out, "indices": n_out}),
)
ts_ds_downsampled["time"] = ts_ds["time"].isel(time=indices.values[0])
return ts_ds_downsampled.rename("data")
If indices
is non-deterministic for every time _help_downsample
is called for new data
, doesn't that mean you want a unique vector of indices at all non-core input dims? You're calling apply_ufunc(..., vectorize=True)
so the helper function is being called once per 1D timeseries, and selecting indices[0]
would be throwing away unique index vectors in that case.
You're right. I suppose I misunderstood; I thought it was stacking all the channels to calculate the indices.
Perhaps what I really want to do is just the following without apply_ufunc
new_da = da.stack({"temp": ["time", "channel"]})
MinMaxLTTBDownsampler().downsample(new_da["temp"], new_da["data"], n_out=n_out)
I've run in to this before. There's no way to limit the broadcasting behaviour, so if there are broadcast dimensions on the input variables, those are assumed to be the broadcast dimensions on all output variables.
I believe the best solution is https://github.com/pydata/xarray/issues/1618 . In the mean time if you don't need vectorize
you can specify all input and output dims as core dimensions and then there are no broadcast dimensions at all.
Is there a way to do this without a loop?
import numpy as np
t0 = [0, 1, 2]
t1 = [3, 4, 5]
ts = [t0, t1]
z = np.array([[6, 7, 8], [9, 10, 11]])
ds = xr.DataArray(z, dims=["channel", "time"]).to_dataset("channel")
for i, channel in enumerate(ds.data_vars):
dim = f"time_{i}"
ds[channel] = (dim, ds[channel].values)
ds[channel] = ds[channel].assign_coords({dim: ts[i]})
ds
If the time
vectors are all the same length, you can just store as a multi-dimensional coordinate:
ds = xr.DataArray(
z,
dims=["channel", "t_idx"],
coords={"time": (("channel", "t_idx"), ts)}
)