clouddrift icon indicating copy to clipboard operation
clouddrift copied to clipboard

Range-aware subset

Open milancurcic opened this issue 1 year ago • 9 comments

As discussed with @selipot today who proposed this idea.

Current implementation of subset is cloud-optimized for criteria that have a traj dimension, for example, subsetting by ID:

subset(ds, {"ID": [2578, 2582, 2583]})

However, subsetting by criteria that have an obs dimension, for example, subsetting by region or time:

subset(ds, {"lat": (21, 31), "lon": (-98, -78)})

requires downloading the entire variables that appear in the criteria to make the comparison locally.

However, if the range (min and max) of these variables were known, subset could subset by ID under the hood, thus effectively doing the subset by obs dimension in a cloud-optimized way.

clouddrift could propose the following requirement for cloud-optimized ragged arrays: Every numeric variable <var> with the obs dimension will be accompanied by the variables <var_min> and <var_max> with the traj dimension.

If the expected range variables are still not found in the dataset, subset could proceed to carry out the comparison as is in the current implementation.

milancurcic avatar Jul 07 '23 19:07 milancurcic