pytesmo
pytesmo copied to clipboard
New CDF matching implementation
This is a major rewrite of the CDF matching implementation currently used in pytesmo. It implements a few bugfixes and performance improvements (see #255) and provides a new interface that separates the calculation of the percentile values from the scaling step.
It replaces the deprecated method scaling.cdf_match
with a method that has a similar functionality and interface as scaling.cdf_beta_match
, the currently recommended method (and removes thereby the old implementation of scaling.cdf_match
).
I assume it would be best to also remove the deprecated method lin_cdf_match
, but it's currently still in use in the validation framework.
@wpreimes, @pstradio, what would still be necessary to replace the lin_cdf_match method in the validation framework with the new implementation? I think it would be best if we only provide a single method for CDF matching.
Also, @pstradio, do you miss any other features that prevents you from using this implementation in pytesmo?
Regarding the lin_cdf_match
method, although it is the default in the validation framework, it is overridden in QA4SM (which does not display an option for linear matching anyways). Therefore I don't see any additional requisite to replace the default there.
I don't see that any features are missing (at least for using this in CCI). As Wolfi mentioned in #255, a future addition could be to also add dynamic (seasonal, doy, etc.) methods, but I would leave it to a separate PR.
Does this answer the questions?
Regarding the lin_cdf_match method, although it is the default in the validation framework, it is overridden in QA4SM (which does not display an option for linear matching anyways). Therefore I don't see any additional requisite to replace the default there.
Would it then be possible to remove the lin_cdf_match
method and replace it with the new implementation in the validation framework?
Regarding the lin_cdf_match method, although it is the default in the validation framework, it is overridden in QA4SM (which does not display an option for linear matching anyways). Therefore I don't see any additional requisite to replace the default there.
Would it then be possible to remove the
lin_cdf_match
method and replace it with the new implementation in the validation framework?
Sure. Do you want me to do this in the present PR?
No, I can do it.
Just to be sure: qa4sm uses the cdf_beta_match
method https://github.com/awst-austria/qa4sm/blob/master/validator/models/validation_run.py#L25 --> when changing the name in pytesmo we have to change it in qa4sm as well. I think it would be good to only provide a single cdf matching method (or two, maybe one with linear and one with non-linear interpolation? I remember that there were some issues with the non-linear method especially at the edges, but maybe that is fixed now?)
I thought about changing the default method some time ago, because of the deprecation warnings in pytesmo, but as the cdf_beta_match method changed the scaling results (slightly) in my tests, I thought someone with a better overview should decide :-)
tldr: overall I'm fine with changing it
The implementation I am using at the moment only allows linear interpolation, but this can easily be changed by providing an additional setting for the interpolation spline order.
I ran into a few more issues:
Edge scaling with unequal amounts of data in the source and reference bins
In the edge scaling, the data from the lowest and highest bin for both source (x) and reference (y) are selected. If there are multiple values directly on the bin edges (happens if the data is only on a few discrete levels), the number of data in a bin might differ for x and y.
In the current implementation the first/last n values of both bins are taken, where n is the minimum of both bin sizes. This happens before the data is sorted.
https://github.com/TUW-GEO/pytesmo/blob/8dacef762f39983bd2a0d367973c5b7260a0480f/src/pytesmo/utils.py#L345-L348
In the new implementation, the data are sorted beforehand. This leads to a low/high bias compared to the "random" selection in the current implementation, because this way we make sure to select the n lowest/highest values of each bin. In one test case this even lead to a very flat upper CDF edge, because there were only a few values in the upper x bin, meaning that only a few outliers in y were selected for the linear regression.
~~To overcome these inconsistencies I decided that it might be best to set n = number of data in lowest/highest y-bin, because the edge scaling should scale the edges of the y-CDF.~~
The even better solution would probably resampling of the x data according to its empirical CDF and then perform the linear regression. This is implemented in the new version.
Beta-matching
The old implementation fitted the percentiles for x to a beta distribution in case of non-unique values. The non-unique percentiles in y where removed with an interpolation approach.
In one of the test cases this lead to deviations between the original CDF and the CDF of the matched data (green vs. blue line below). This did not happen in the lin_cdf_match implementation without the beta matching (orange line).
I therefore changed the code to also use the interpolation for non-unique values in the x-percentiles.