clouddrift icon indicating copy to clipboard operation
clouddrift copied to clipboard

⭐ (feature) ragged operation on ragged xarray *datasets*

Open selipot opened this issue 4 months ago • 3 comments

I am wondering if it would not be worth to extend some of the functionalities of the ragged module to operate not only on ragged arrays, such as xarray DataArrays, but also on xarray Datasets.

As an example, imagine we want to use the segment function to "split" the trajectories of a ragged xarray dataset ds with dimensions traj and obs and a row size variable rowsize of dimension traj. The segment function might be applied on a ragged array variable of ds of dimension obs, such as ds["time"], and returns an array which is a new rowsize variable called new_rowsize that segments/divides the input array into new rows (more rows than previously). Then, what if we want to substitute that new_rowsize in the original xarray dataset ds and work from there? In other words we would need to transform the entire xarray dataset to change the dimension traj to match len(new_rowsize). This would include splitting also accordingly all the variables of dimension traj to map them on the new dimension len(new_rowsize).

Or maybe this type of functionality should be folded in subset? @philippemiron I would love to hear what you think.

Here is what I tried which in the end does not work:

# open locally the global drifter hourly dataset, alternatively can be opened from the cloud
# ds = cd.datasets.gdp1h()
# probably faster to not decode times
ds = xr.open_dataset("/Users/selipot/Data/awss3/latest/gdp-v2.01.1.zarr",engine="zarr",decode_times=False)
print(np.sum(ds['rowsize'])) # 197214787

# %% need to segment first; no gaps in time larger than 3600 seconds
segment_size = cd.ragged.segment(ds["time"],3600,ds["rowsize"])
print(len(segment_size)) # 100235
print(np.sum(segment_size)) # 197214787 # this keeps the same number of obs

# %% we now want to keep data that are only 5 days (5*24=120 hours or points) long
# variables we want to work with are lon, lat, time, ve, vn
min_length = 120
lon, segment120_size = cd.ragged.prune(ds["lon"],segment_size,min_length)

print(len(segment120_size)) # 48058 this is the number of segments that are at least 120 hours long
print(np.sum(segment120_size)) # 195462963 this is the number of observations in the segments that are at least 120 hours long

# %% now trying to subset the data to only keep the segments that are at least 120 hours long
# add to `ds` a new variable `segmentsize` with a new dimension that is segment = len(segment120_size)
ds = ds.assign_coords(segment=("segment", np.arange(len(segment120_size))))
ds["segmentsize"] = (("segment",), segment120_size)

print(ds)

# %%
ds2 = cd.ragged.subset(ds, {"segmentsize": (120, np.inf)}, row_dim_name="segment", id_var_name="id")
ds2

selipot avatar Oct 03 '24 19:10 selipot