[POC] implement scalar metrics
Context
This ticket aims to propose an initial implementation of scalar metrics in xDEM.
We assume that these metrics can be an attribute of DEM just like the slope, and that users can retrieve them as follows: dem_mean = xdem.scalar_metric.mean(dem). If this API is validated, here is our proposal:
Code
-
[ ] Create the file
scalar_metrics.py(inspired byterrain.py) -
[ ] List of metrics to implement:
- [ ] List item:
- "mean"
- "max"
- "min"
- "std"
- "rmse"
- "median"
- "nmad"
- "sum"
- "sumSquaredErr"
- "Percentil90"
- [ ] List item:
-
[ ] Implement the
get_scalar_metricfunction with all the options from the above list:
@overload
def get_scalar_metric(
dem: NDArrayf | MArrayf,
attribute: str,
) -> np.float32:
...
@overload
def get_scalar_metric(
dem: RasterType,
attribute: list[str],
) -> np.float32:
...
def get_scalar_metric(
dem: NDArrayf | MArrayf | RasterType,
attribute: str | list[str],
) -> NDArrayf | list[NDArrayf] | RasterType | list[RasterType]:
"""
:param dem: The DEM to analyze.
:param attribute: The terrain attribute(s) to calculate.
:raises ValueError: If the inputs are poorly formatted or are invalid.
:examples: >>> dem = np.repeat(np.arange(3), 3).reshape(3, 3)
>>> dem array([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
>>> min, max = get_scalar_metric(dem, ["min", "max"])
>>> min
0
>>> max
2
:returns: One or multiple arrays of the requested attribute(s) """
# Validate and format the inputs
if isinstance(attribute, str):
attribute = [attribute]
# Initialize the scalar_metrics dictionary, which will be filled with the requested values.
terrain_attributes: dict[float] = {}
# Get array of DEM
if isinstance(data, np.ma.masked_array):
data_arr = get_array_and_mask(data, check_shape=False)[0]
elif isinstance(data, Raster):
data_arr = data
else:
data_arr = np.asarray(data)
if "mean" in attribute:
scalar_metrics["mean"] = np.nanmean(data_arr)
if "max" in attribute:
scalar_metrics["mean"] = np.nanmax(data_arr)
if ...:
return output_attributes if len(output_attributes) > 1 else output_attributes[0]
- [ ] Each metric should be implemented as follows:
@overload
def mean(
dem: NDArrayf | MArrayf,
) -> NDArrayf:
...
@overload
def mean(
dem: RasterType,
) -> RasterType:
...
def mean(
dem: NDArrayf | MArrayf,
) -> NDArrayf | RasterType:
"""
Calculate dem mean value :param dem: The input DEM to calculate the mean from.
:raises AssertionError: If the given DEM is not a 2D array. :raises ValueError: If invalid argument types or ranges were given.
:returns: A mean value with the dtype "float32". """
return get_scalar_metric(
dem,
attribute="mean",
)
- [ ] Handling the NMAD metric:
- [ ] It is already implemented; retrieve it from the
spatialstatsfile. - [ ] It is called in various places; reinitialize it as follows:
nmad = xdem.scalar_metric.nmad(dem)- [ ]
affine.py - [ ]
base.py - [ ]
workflows.py - [ ]
spatialstats.py
- [ ]
- [ ] The nfact is hardcoded in demcompare and parameterized in xdem; if we decide to use it as a parameter, modify the APIs accordingly, e.g.:
- [ ] It is already implemented; retrieve it from the
@overload
def get_scalar_metric(
dem: RasterType,
attribute: list[str],
nfact: float = 1.4826
) -> np.float32:
...
- [ ] from DEM class : should implement each metrics as follow
@copy_doc(scalar_metrics, remove_dem_res_params=True)
def mean(
self
) -> float:
return scalar_metrics.mean(
self
)
Tests
- [ ] Create the file
test_scalar_metrics.py - [ ] Test all possibilities: It is possible to retrieve the TUs from demcompare: test_scalar_metric.py
- [ ] To test with all array types, move the
test_nmadfunction fromtest_spatial_statstotest_scalar_metrics.py
Example
- [ ] Add an example script in the
basicfolder that displays the scalar metrics of a DEM using the new APIs
Documentation
- [ ] Add a
scalar_metrics.mdfile in the xdem documentation inspired by theterrain.mddocumentation
Estimation
5 days
Ok for the need of scalar metrics, the issue is clear.
Will it be possible to choose a metric from a calling pipeline with a configuration file as demcompare ? My question is to be sure that there will be enough architecture flexibility of xdem code to have a kind of a generic metric structure, function, call, object ... to be manipulated easily from a cli asking one or another metric
For example here to call the scalar metrics from the pipeline: https://demcompare.readthedocs.io/en/stable/userguide/statistics/metrics.html and the following example of input json configuration to explain my question ?
{
"output_dir": "./test_output/",
"input_ref": {
"path": "./Gironde.tif",
"nodata": -9999.0
},
"input_sec": {
"path": "./FinalWaveBathymetry_T30TXR_20200622T105631_D_MSL_invert.TIF",
"nodata": -32768
}
"statistics": {
"alti-diff": {
"remove_outliers": true,
"metrics": ["mean", {"ratio_above_threshold": {"elevation_threshold": [1, 2, 3]}}]
},
"ref": {
"remove_outliers": true,
"metrics": [
{
"slope-orientation-histogram": {
"output_plot_path": "path_to_plot"
}
}
]
}
}
}
Will this be possible with the development here and we may need an evolution ou wrapper on existing API to generalize metrics ? I try to foresee possible problems to put a cli
Based on my limited experience with xdem, I believe it's possible. I will create an issue for the metrics in the CLI before developing this one, just to be sure.
From @adehecq :
"In fact, the other functions are not specifically implemented because they are available with NumPy (e.g., min, max, std, var, etc.). The interface with NumPy (https://geoutils.readthedocs.io/en/stable/core_array_funcs.html) makes these calculations direct and intuitive.
In this sense, xdem seems more advanced because we have access to any existing metric in NumPy or SciPy."
One more note in addition to @adehecq's that might be relevant here: Xarray has exactly the same NumPy array interface as we have for the Raster object (can call any np.func() or derivative like some SciPy's functions, on the object, and it knows where to go look for the array), so the array interfacing approach will work exactly the same with the planned dem Xarray accessor.
Usually the critical point when deriving those metrics through an array interface is the behaviour with nodata values:
- For a Xarray object, one needs to force a float32+/NaN array to ensure the nodata are masked during the array interfacing operation,
- For a Raster object, the NumPy masked arrays allows to support any data type, but the casting of classic NumPy functions to masked arrays is suspicious for a couple functions. For instance
np.percentilecannot be cast tonp.ma.percentilethat does not exist in the masked array module, and so returns a NaN instead of masking nodata values, while other functions are naturally cast. There's only a few exceptions that don't work so well like this. To remedy this, I've added an override for those functions here: https://github.com/GlacioHack/geoutils/blob/main/geoutils/raster/raster.py#L2329. NumPy is working to make masked array fully consistent (including being able to callnp.ma.funcon any object that has a masked array, which is impossible right now), see for details: https://github.com/GlacioHack/geoutils/issues/358. Optionally, we could open an issue in NumPy to re-ignite the topic and make those aspects fully consistent.
Overall, I agree with adding functionalities to address certain scalar metrics. Just a few remarks/questions:
- I believe this should sit in geoutils rather than xdem, as it can be generalized to any raster
- I'm wondering if having a function
geoutils.scalar_metric.mean()for each metric makes sense. This would add confusion to using the easier alternative ofnp.mean(my_raster). - your example with the mean shows that it would add at least 20 lines of code (because of the overloads) to the code, for each metric, whereas the same can be obtained in 1 line for many of these statistics.
- because of the two previous points, I'm wondering if we should not just add a method to the raster class for these statistics, like in pandas, e.g.
my_raster.min(),my_raster.nmad()etc. Or if we don't want to add too many methods, they could be gathered in amy_raster.scalar_metric.min()etc. We get rid of the overloads (because it will only work on Raster type) and most methods would be a 1 line code. - regarding NMAD: we definitely discussed before the fact that nmad should be moved to a more appropriate location, so that's perfect. The only reason for having the
nfactargument is probably to calculate the MAD instead of NMAD, but personally I have never used that argument and don't know of anyone who has, so we could also remove it (and add a MAD metric if relevant).
@duboise-cnes, I can't really answer your question as I didn't understand the meaning of this line
"metrics": ["mean", {"ratio_above_threshold": {"elevation_threshold": [1, 2, 3]}}]
One more note that @adehecq's comment reminded me of on the MAD:
Adding the MAD to NumPy was already proposed and rejected 5+ years ago (with a not such a convincing excuse, in my opinion, pointing out the "niche" aspect of robust estimators), they redirected towards a statsmodel implementation at the time. Since then (3 years ago), there is a consistent SciPy implementation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.median_abs_deviation.html which is already a dependency for many other functionalities.
See details in our issue here: https://github.com/GlacioHack/xdem/issues/349.
An another approach has been discuss, i close for #567