clouddrift
clouddrift copied to clipboard
⭐ (feature) ragged operation on ragged xarray *datasets*
I am wondering if it would not be worth to extend some of the functionalities of the ragged
module to operate not only on ragged arrays, such as xarray DataArrays, but also on xarray Datasets.
As an example, imagine we want to use the segment
function to "split" the trajectories of a ragged xarray dataset ds
with dimensions traj
and obs
and a row size variable rowsize
of dimension traj
. The segment
function might be applied on a ragged array variable of ds
of dimension obs
, such as ds["time"]
, and returns an array which is a new rowsize variable called new_rowsize
that segments/divides the input array into new rows (more rows than previously). Then, what if we want to substitute that new_rowsize
in the original xarray dataset ds
and work from there? In other words we would need to transform the entire xarray dataset to change the dimension traj
to match len(new_rowsize)
. This would include splitting also accordingly all the variables of dimension traj
to map them on the new dimension len(new_rowsize)
.
Or maybe this type of functionality should be folded in subset
? @philippemiron I would love to hear what you think.
Here is what I tried which in the end does not work:
# open locally the global drifter hourly dataset, alternatively can be opened from the cloud
# ds = cd.datasets.gdp1h()
# probably faster to not decode times
ds = xr.open_dataset("/Users/selipot/Data/awss3/latest/gdp-v2.01.1.zarr",engine="zarr",decode_times=False)
print(np.sum(ds['rowsize'])) # 197214787
# %% need to segment first; no gaps in time larger than 3600 seconds
segment_size = cd.ragged.segment(ds["time"],3600,ds["rowsize"])
print(len(segment_size)) # 100235
print(np.sum(segment_size)) # 197214787 # this keeps the same number of obs
# %% we now want to keep data that are only 5 days (5*24=120 hours or points) long
# variables we want to work with are lon, lat, time, ve, vn
min_length = 120
lon, segment120_size = cd.ragged.prune(ds["lon"],segment_size,min_length)
print(len(segment120_size)) # 48058 this is the number of segments that are at least 120 hours long
print(np.sum(segment120_size)) # 195462963 this is the number of observations in the segments that are at least 120 hours long
# %% now trying to subset the data to only keep the segments that are at least 120 hours long
# add to `ds` a new variable `segmentsize` with a new dimension that is segment = len(segment120_size)
ds = ds.assign_coords(segment=("segment", np.arange(len(segment120_size))))
ds["segmentsize"] = (("segment",), segment120_size)
print(ds)
# %%
ds2 = cd.ragged.subset(ds, {"segmentsize": (120, np.inf)}, row_dim_name="segment", id_var_name="id")
ds2