tobac Cell IDs are floats, while feature IDs are ints

In the tobac_v1 theme's tracking step, a cell ID is assigned, and the convention is to use NaN to mark that a feature has not been assigned to a cell track, so the cell ID for each feature has to be a float.

However, for the segmentation mask the convention is to use 0 if no feature is found at a grid box, so the feature ID can be an integer.

I'd prefer integer IDs, since those aren't subject to float rounding problems in the case of very large integers, and they compress better. The downside is having to adopt a particular integer to mark missing data, though there are CF conventions for that, i.e., the valid_range and _FillValue attributes.

Would the maintainers be open to switching the cell IDs to integers?

Jan 18 '22 00:01 deeplycloudy

We've had the same discussions here at CSU. My thought is to have this be a user-set parameter (perhaps defaulted to np.nan to keep compatibility?), but I'm not sure what everyone else thinks.

Jan 18 '22 04:01 freemansw1

Worth noting that if we switch this over to a user-set parameter, it would be a good idea to do the same for segmentation.

Jan 18 '22 04:01 freemansw1

One more clarification: the cell variable in the tracking output is actually an object array like this: array([nan, 2, 3], dtype=object) I thought it had a float dtype, but that was an artifact of having saved the tracking data using .to_netcdf, which seems to have cast it to float.

Jan 19 '22 19:01 deeplycloudy

@deeplycloudy Yes, Because np.NaN is float, it forces the entire array of integers to become floating-point numbers. https://github.com/pydata/xarray/issues/6091#issuecomment-998789248

Jan 19 '22 20:01 zxdawn

It may also be worth re-considering how we approach the convention for segmentation masks as part of this discussion. Presently, a value of 0 in the mask can mean one of two things: 1) the grid cell is below the prescribed threshold for the segmentation field, or 2) the grid cell is above the threshold but was not watershed into by a feature. It might be prudent to, say, keep the latter as 0 while setting those ineligible points to a value of -1. This is the approach that I used for introducing periodic boundary segmentation, as we want to make sure we watershed into eligible fields where a feature is on the other side of a boundary, and the present method doesn't make this distinction.

Jan 20 '22 18:01 galexsky

Just fixed the same problem in some of my own code, and would agree that setting missing/invalid values to -1 is an improvement over NaN. We could use a larger negative number, such a -999 or -9,999 as in some netcdf datasets, to more clearly signify that this is invalid data. I would propose setting the convention within tobac that any negative integer is treated as a ineligible point in these circumstances, with 0 for unassigned/unsegmented points

Feb 03 '22 23:02 w-k-jones

I've put my thoughts on a resolution to this in #74 . I'm certainly open to comments on that PR, but I've decided for now to just address the cell issue rather than the segmentation issue as well. Given that we allow users to set the start cell number, I went with the approach of allowing users to set the invalid cell number as well.

Feb 18 '22 18:02 freemansw1

Thanks @freemansw1! If/once #74 goes through, I agree with your suggestion above to make the segmentation ID convention user selectable (with the same kwarg naming convention as for cells).

Mar 04 '22 21:03 deeplycloudy

The unassigned cell number problem for tracking has now been resolved in dev with #74. This still doesn't resolve the segmentation problem, but I think we will try to address that after PBCs are in. As @galexsky says above, we use those markers internally as well.

Mar 28 '22 15:03 freemansw1

#285 is the resolution to this on the segmentation side. I'll close this issue when that is merged into the 1.5 RC.

May 24 '23 18:05 freemansw1

Resolved with #285

May 30 '23 15:05 freemansw1

tobac tobac copied to clipboard

Cell IDs are floats, while feature IDs are ints

tobac
tobac copied to clipboard