pytesmo New CDF matching implementation

New CDF matching implementation

Open s-scherrer opened this issue 2 years ago • 7 comments

This is a major rewrite of the CDF matching implementation currently used in pytesmo. It implements a few bugfixes and performance improvements (see #255) and provides a new interface that separates the calculation of the percentile values from the scaling step.

It replaces the deprecated method scaling.cdf_match with a method that has a similar functionality and interface as scaling.cdf_beta_match, the currently recommended method (and removes thereby the old implementation of scaling.cdf_match).

I assume it would be best to also remove the deprecated method lin_cdf_match, but it's currently still in use in the validation framework.

@wpreimes, @pstradio, what would still be necessary to replace the lin_cdf_match method in the validation framework with the new implementation? I think it would be best if we only provide a single method for CDF matching.

Also, @pstradio, do you miss any other features that prevents you from using this implementation in pytesmo?

Jan 09 '22 16:01 s-scherrer

Regarding the lin_cdf_match method, although it is the default in the validation framework, it is overridden in QA4SM (which does not display an option for linear matching anyways). Therefore I don't see any additional requisite to replace the default there.

I don't see that any features are missing (at least for using this in CCI). As Wolfi mentioned in #255, a future addition could be to also add dynamic (seasonal, doy, etc.) methods, but I would leave it to a separate PR.

Does this answer the questions?

Jan 10 '22 08:01 pstradio

Regarding the lin_cdf_match method, although it is the default in the validation framework, it is overridden in QA4SM (which does not display an option for linear matching anyways). Therefore I don't see any additional requisite to replace the default there.

Would it then be possible to remove the lin_cdf_match method and replace it with the new implementation in the validation framework?

Jan 10 '22 09:01 s-scherrer

Regarding the lin_cdf_match method, although it is the default in the validation framework, it is overridden in QA4SM (which does not display an option for linear matching anyways). Therefore I don't see any additional requisite to replace the default there.

Would it then be possible to remove the lin_cdf_match method and replace it with the new implementation in the validation framework?

Sure. Do you want me to do this in the present PR?

Jan 10 '22 12:01 pstradio

No, I can do it.

Jan 10 '22 12:01 s-scherrer

Just to be sure: qa4sm uses the cdf_beta_match method https://github.com/awst-austria/qa4sm/blob/master/validator/models/validation_run.py#L25 --> when changing the name in pytesmo we have to change it in qa4sm as well. I think it would be good to only provide a single cdf matching method (or two, maybe one with linear and one with non-linear interpolation? I remember that there were some issues with the non-linear method especially at the edges, but maybe that is fixed now?) I thought about changing the default method some time ago, because of the deprecation warnings in pytesmo, but as the cdf_beta_match method changed the scaling results (slightly) in my tests, I thought someone with a better overview should decide :-) tldr: overall I'm fine with changing it

Jan 10 '22 13:01 wpreimes

The implementation I am using at the moment only allows linear interpolation, but this can easily be changed by providing an additional setting for the interpolation spline order.

Jan 10 '22 13:01 s-scherrer

I ran into a few more issues:

Edge scaling with unequal amounts of data in the source and reference bins

In the edge scaling, the data from the lowest and highest bin for both source (x) and reference (y) are selected. If there are multiple values directly on the bin edges (happens if the data is only on a few discrete levels), the number of data in a bin might differ for x and y.

In the current implementation the first/last n values of both bins are taken, where n is the minimum of both bin sizes. This happens before the data is sorted.

https://github.com/TUW-GEO/pytesmo/blob/8dacef762f39983bd2a0d367973c5b7260a0480f/src/pytesmo/utils.py#L345-L348

In the new implementation, the data are sorted beforehand. This leads to a low/high bias compared to the "random" selection in the current implementation, because this way we make sure to select the n lowest/highest values of each bin. In one test case this even lead to a very flat upper CDF edge, because there were only a few values in the upper x bin, meaning that only a few outliers in y were selected for the linear regression.

~~To overcome these inconsistencies I decided that it might be best to set n = number of data in lowest/highest y-bin, because the edge scaling should scale the edges of the y-CDF.~~

The even better solution would probably resampling of the x data according to its empirical CDF and then perform the linear regression. This is implemented in the new version.

Beta-matching

The old implementation fitted the percentiles for x to a beta distribution in case of non-unique values. The non-unique percentiles in y where removed with an interpolation approach.

In one of the test cases this lead to deviations between the original CDF and the CDF of the matched data (green vs. blue line below). This did not happen in the lin_cdf_match implementation without the beta matching (orange line).

I therefore changed the code to also use the interpolation for non-unique values in the x-percentiles.

CDF

Feb 01 '22 16:02 s-scherrer

pytesmo pytesmo copied to clipboard

New CDF matching implementation

Edge scaling with unequal amounts of data in the source and reference bins

Beta-matching

pytesmo
pytesmo copied to clipboard