tobac
tobac copied to clipboard
Cell IDs are floats, while feature IDs are ints
In the tobac_v1
theme's tracking step, a cell ID is assigned, and the convention is to use NaN to mark that a feature has not been assigned to a cell track, so the cell ID for each feature has to be a float.
However, for the segmentation mask the convention is to use 0 if no feature is found at a grid box, so the feature ID can be an integer.
I'd prefer integer IDs, since those aren't subject to float rounding problems in the case of very large integers, and they compress better. The downside is having to adopt a particular integer to mark missing data, though there are CF conventions for that, i.e., the valid_range
and _FillValue
attributes.
Would the maintainers be open to switching the cell IDs to integers?
We've had the same discussions here at CSU. My thought is to have this be a user-set parameter (perhaps defaulted to np.nan
to keep compatibility?), but I'm not sure what everyone else thinks.
Worth noting that if we switch this over to a user-set parameter, it would be a good idea to do the same for segmentation.
One more clarification: the cell
variable in the tracking output is actually an object array like this: array([nan, 2, 3], dtype=object)
I thought it had a float dtype, but that was an artifact of having saved the tracking data using .to_netcdf
, which seems to have cast it to float.
@deeplycloudy Yes, Because np.NaN is float, it forces the entire array of integers to become floating-point numbers. https://github.com/pydata/xarray/issues/6091#issuecomment-998789248
It may also be worth re-considering how we approach the convention for segmentation masks as part of this discussion. Presently, a value of 0 in the mask can mean one of two things: 1) the grid cell is below the prescribed threshold for the segmentation field, or 2) the grid cell is above the threshold but was not watershed into by a feature. It might be prudent to, say, keep the latter as 0 while setting those ineligible points to a value of -1. This is the approach that I used for introducing periodic boundary segmentation, as we want to make sure we watershed into eligible fields where a feature is on the other side of a boundary, and the present method doesn't make this distinction.
Just fixed the same problem in some of my own code, and would agree that setting missing/invalid values to -1 is an improvement over NaN. We could use a larger negative number, such a -999 or -9,999 as in some netcdf datasets, to more clearly signify that this is invalid data. I would propose setting the convention within tobac that any negative integer is treated as a ineligible point in these circumstances, with 0 for unassigned/unsegmented points
I've put my thoughts on a resolution to this in #74 . I'm certainly open to comments on that PR, but I've decided for now to just address the cell
issue rather than the segmentation issue as well. Given that we allow users to set the start cell number, I went with the approach of allowing users to set the invalid cell number as well.
Thanks @freemansw1! If/once #74 goes through, I agree with your suggestion above to make the segmentation ID convention user selectable (with the same kwarg naming convention as for cells).
The unassigned cell number problem for tracking has now been resolved in dev
with #74. This still doesn't resolve the segmentation problem, but I think we will try to address that after PBCs are in. As @galexsky says above, we use those markers internally as well.
#285 is the resolution to this on the segmentation side. I'll close this issue when that is merged into the 1.5 RC.
Resolved with #285