tobac Saving more information on data distribution within feature

Currently, the center point locations (lat, lon/ x,y) the threshold (threshold_value) and the number of grid cells within that threshold (num) are the only info that we save about individual features (see Feature detection output ).

It might be useful to add some bulk statistics about the datapoints within each feature/ segmented area (e.g. min, max, mean, percentiles, optionally the sum of all values, etc) in the feature and segmentation dataframes. This idea has been originally suggested in issue #18 and discussed with in #52, but I splitted issue #18, because this is a separate task. @freemansw1 suggested that this could be something for either v1.4.x or v1.5x and it may be good to wait until we have moved completely to xarray.

Some things that need to be discussed before working on a PR:

In addition to bulk statistics, can we save the location of all feature points (as suggested by @freemansw1 in #52) ? Maybe optionally create a mask file just as the one we create for the segmentation?
Is it smarter to implement this as a postprocessing step or within the feature detection/ segmentation code? It would certainly be more computationally efficient and fairly easy to save this information directly within the feature detection/segmentation. However, it could be an advantage to make it a postprocessing step, because this would allow users to derive the same information using a different dataset (e.g. deriving precipitation statistics within cloud mask from a brightness temperature based tracking).
Maybe it would also be useful to have a function that updates the segmentation mask based on a modified feature dataframe. That way, one could easily use these additional statistics to filter out cells with specific properties and make sure that the mask files only contain those. I can see a lot of use cases for this, but could also be a postprocessing task that users should take care of themselves.

Jul 02 '22 16:07 JuliaKukulies

Thanks for splitting off this issue! Here are some of my thoughts:

In addition to bulk statistics, can we save the location of all feature points (as suggested by @freemansw1 in https://github.com/tobac-project/tobac/issues/52) ? Maybe optionally create a mask file just as the one we create for the segmentation?

I am inclined to wait for v2.x for this, as doing this with Iris sounds like a headache. We could push to get this in v1.x, but I think the best way to implement this on the backend would be to run it through xarray and then only convert to Iris at the end to output. I'm not sure how many people would use this feature, too, and we would need to document it thoroughly to avoid confusion with segmentation.

Is it smarter to implement this as a postprocessing step or within the feature detection/ segmentation code? It would certainly be more computationally efficient and fairly easy to save this information directly within the feature detection/segmentation. However, it could be an advantage to make it a postprocessing step, because this would allow users to derive the same information using a different dataset (e.g. deriving precipitation statistics within cloud mask from a brightness temperature based tracking).

We could implement this in utils and optionally call it during feature detection? That would allow the best of both worlds; better efficiency, but also allow users to call it as a postprocessing step.

Maybe it would also be useful to have a function that updates the segmentation mask based on a modified feature dataframe. That way, one could easily use these additional statistics to filter out cells with specific properties and make sure that the mask files only contain those. I can see a lot of use cases for this, but could also be a postprocessing task that users should take care of themselves.

Hm... I'm not sure I know which way to go on this. Are you thinking re-running segmentation with the different feature dataframe (therefore changing the result of segmentation), or keeping segmentation and just zeroing/unassigning features not included in this modified dataset?

Jul 02 '22 22:07 freemansw1

Thanks for your thoughts @freemansw1 !

I am inclined to wait for v2.x for this, as doing this with Iris sounds like a headache.

OK, agreed. So a way to go would be to start working on a PR for the bulk statistics and then wait if this should be extended to each feature point or not?

We could implement this in utils and optionally call it during feature detection?

That sounds like a smart solution!

Hm... I'm not sure I know which way to go on this. Are you thinking re-running segmentation with the different feature dataframe (therefore changing the result of segmentation), or keeping segmentation and just zeroing/unassigning features not included in this modified dataset?

I was thinking of the second option, but now that you remind me that it is always possible to just re-run the segmentation (which is rather fast now with your updates) based on a different feature dataframe (or even track dataframe if you only which go get the linked features), this may actually be a redundant feature.

Jul 03 '22 10:07 JuliaKukulies

tobac tobac copied to clipboard

Saving more information on data distribution within feature

tobac
tobac copied to clipboard