tobac icon indicating copy to clipboard operation
tobac copied to clipboard

Feature variable names and the road to a combined dataset

Open deeplycloudy opened this issue 2 years ago • 5 comments

One of tobac's advantages is that it keeps each step in the tracking process separate. Right now, each step using tobac's various tracking functions produces a separate dataset or array.

However, as I think about producing datasets for sharing among work groups and for long-term archival, I'd like to be able to create a combined data file with all the feature data table, the feature mask, and tracked-feature cell IDs, along with the projection data for the feature mask.

With some judicious renaming, and keeping in mind the parent-child relationship ideas in the CF-tree proposal, it is in fact possible to combine everything into one dataset. Here's a function that does both jobs. It uses the xarray data structures returned by the v2.0-dev branch.

def standardize_track_dataset(TrackedFeatures, Mask, Projection):
    """ Combine a feature mask Mask with the feature data table TrackedFeatures
    into a common dataset. Requires a CF-compliant Projection variable corresponding to the Mask,
    which indicates the earth-based coordinate system used for Mask.

    Variable names in TrackedFeatures are renamed to follow the parent-child hierarchy
    cell-feature, as in the cf-tree convention. Names are more descriptive and additional long
    descriptions and metadata are added where appropriate.

    Mask is as returned by tobac.themes.tobac_v1.segmentation
    TrackedFeatures is as returned by tobac.themes.tobac_v1.linking_trackpy.
    Projection is an xarray DataArray; pass the relevant variable from the source gridded dataset.

    Adds a cell dimension and a cell_id variable with the list of unique cell IDs. cell_id includes
    the ID used for features with no parent cell, preserving a self-consistent parent-child tree.

    Returns a combined xarray dataset.

    TODO: Add metadata attributes for each variable.

    """
    feature_standard_names = {
        # new variable name, and long description for the NetCDF attribute

        # feature dimension
        'frame':('feature_time_index',
            'positional index of the feature along the time dimension of the mask, from 0 to N-1'),
        'hdim_1':('feature_hdim1_coordinate',
            'position of the feature along the first horizontal dimension in grid point space;'
            ' a north-south coordinate for dim order (time, y, x).'
            ' The numbering is consistent with positional indexing of the coordinate, but can be'
            ' fractional, to account for a centroid not aligned to the grid.'),
        'hdim_2':('feature_hdim2_coordinate',
            'position of the feature along the second horizontal dimension in grid point space;'
            'an east-west coordinate for dim order (time, y, x)'
            ' The numbering is consistent with positional indexing of the coordinate, but can be'
            ' fractional, to account for a centroid not aligned to the grid.'),
        'idx':('feature_id_this_frame', 'id of the feature in this frame (meaning uncertain)'),
        'num':('feature_grid_cell_count',
            'Number of grid points that are within the threshold of this feature'),
        'threshold_value':('feature_threshold_max',
            "Feature number within that frame; starts at 1, increments by 1 to the number of"
            " features for each frame, and resets to 1 when the frame increments"),
        'feature':('feature_id', "Unique number of the feature;"
            " starts from 1 and increments by 1 to the number of features"),
        'time':('feature_time','time of the feature, consistent with feature_time_index'),
        'timestr':('feature_time_str',
            'String representation of the feature time, YYYY-MM-DD HH:MM:SS'),
        'projection_y_coordinate':('feature_projection_y_coordinate',
            'y position of the feature in the projection given by ProjectionCoordinateSystem'),
        'projection_x_coordinate':('feature_projection_x_coordinate',
            'x position of the feature in the projection given by ProjectionCoordinateSystem'),
        'lat':('feature_latitude','latitude of the feature'),
        'lon':('feature_longitude','longitude of the feature'),
        'ncells':('feature_ncells','number of grid cells for this feature (meaning uncertain)'),
        'areas':('feature_area', 'area of this feature'),
        'cell':('feature_parent_cell_id', 'the cell_id to which this feature belongs'),
        'time_cell':('feature_parent_cell_elapsed_time',
            'elapsed time since the first feature in this cell'),
    }

    # mask variable(s)
    mask_standard_names = {'segmentation_mask':('feature_mask',
        'spatiotemporal grid of feature IDs corresponding to the spatial extent of features'),
        }

    new_feature_var_names = {k:feature_standard_names[k][0] for k in feature_standard_names.keys()
                             if k in TrackedFeatures.variables.keys()}
    new_mask_names = {k:mask_standard_names[k][0] for k in mask_standard_names.keys()
                             if k in Mask.variables.keys()}

    # Combine Track and Mask variables.
    # Use the 'feature' variable as the coordinate variable instead
    # of the 'index' variable and call the dimension 'feature'
    RenamedFeatures = TrackedFeatures.swap_dims(
        {'index':'feature'}).drop('index').rename_vars(new_feature_var_names)
    RenamedMask = Mask.rename_vars(new_mask_names)
    for var, description in feature_standard_names.values():
        if var in RenamedFeatures.variables:
            RenamedFeatures[var].attrs['long_name'] = description
    for var, description in mask_standard_names.values():
        if var in RenamedMask.variables:
            RenamedMask[var].attrs['long_name'] = description
    ds = xr.merge([RenamedFeatures, RenamedMask])

    # Restore the projection data.
    ds['ProjectionCoordinateSystem']=Projection

    # Create a new cell dimension, and assign cell_id as its coordinates.
    cell_id = np.unique(ds.feature_parent_cell_id)
    combined = ds.assign_coords({'cell_id':('cell', cell_id)})
    combined['cell_id'].attrs['long_name'] = 'unique ID number for this cell'

    return combined

There's really two issues here:

  1. variable names and descriptive metadata, which could be pushed upstream into the functions that create each variable, at the cost of compatibility with current uses of those functions.
  2. a convenience function for consolidating the datasets emitted by the steps in a tobac workflow.

Note that step (2) is really only a few lines once the tedious step (1) has been accomplished.

The output data structure looks like this, with attributes expanded to be visible for a few variables: image

What do we think about including a function like this in tobac? Perhaps this dual-purpose function could still fit in v 1.x without causing breakage, and would define the idea of combining datasets. Then, v.2.0 could aim to more fully rename variables by default throughout the library, simplifying the dataset-combining function.

deeplycloudy avatar Mar 04 '22 23:03 deeplycloudy

Hi Eric, I'm definitely supportive of this. My thoughts on your two points, in reverse order:

  1. The creation of a method to consolidate the multiple "v1.0" data outputs into a single "v2.0" combined dataset. As you mention, should be fairly straightforward and would be a useful tool both for converting workflow from v1 to v2, and also for the development of v2. For now we should probably keep the same variable names

  2. Variable names. I'd like these to be user defined, with the defaults being the current variable names used in tobac. Having a "name" keyword argument for input/output of functions wouldn't be too much effort, and would allow more flexibility both in applications and the levels of hierarchy that different approaches are applied to. With the combined dataset approach w could also include some automatic naming of variables. For example, instead of "hdim_1", "hdim_2" for feature positions these variables could be automatically names according to the coords of the input dataset

w-k-jones avatar Mar 07 '22 11:03 w-k-jones

Agreed with @w-k-jones. Maybe a good way to facilitate the transition from v1.x to v2.x here would be to allow users to request xarray output. We could also automatically convert from Iris (segmentation) and Pandas (feature detection/tracking) to xarray in a v1.x version of this function. We already have xarray as a requirement, so it shouldn't be that painful.

I also agree in principle that we should allow users to specify the names of the variables, but I will note that it could cause some confusion as/if users share data. It would also cause us a compatibility headache as we pass data back and forth between functions, especially as the various pieces of tobac grow. I'm not sure what the precedents are here of what other similar libraries do. For the descriptions, completely agreed with @deeplycloudy - these metadata should absolutely be included in our output.

freemansw1 avatar Mar 07 '22 15:03 freemansw1

Regarding an ability to request xarray output, definitely. This would pull some clutter out of our scripts, and would nudge/provide a path for users to adopt xarray as we look toward v2.0.

I think it probably goes too far to break compatibility at this time. However, if we add any features to tobac for cells and tracks (i.e., new feature parents in the hierarchy) we could have duplicate names. Maybe we keep the current names for features, and then adopt a prefix-style convention (as above) for any higher levels that we add?

I also concur with @freemansw1 that we need something consistent internally so that data structure work with any function. If we do move to a user-defined naming convention, we could add a tobac_standard_name attribute, analogous to how CF enables machine-automated detection of spatial coordinates by having a standard_name attribute indicating that a variable named track_center_longtiude is a longitude. It would require functions (or a function decorator) to detect these attributes and standardize the variable names before processing by internal functions.

deeplycloudy avatar Mar 11 '22 14:03 deeplycloudy

I believe that this has actually been resolved with the merge of #136, at least on an experimental basis. @deeplycloudy or @kelcyno any thoughts?

freemansw1 avatar Nov 17 '22 21:11 freemansw1

I'm not sure if this fully addresses was @deeplycloudy was needing, but the combined xarray dataset of feature/segmentation/tracking is available now in the utils (standardize_track_dataset) as of merge #136.

kelcyno avatar Nov 17 '22 21:11 kelcyno