pytesmo icon indicating copy to clipboard operation
pytesmo copied to clipboard

lin_cdf_match and 'x must be strictly increasing' on SciPy > 1.0

Open awst-baum opened this issue 6 years ago • 16 comments

I've run into ISMN timeseries files that are turned into a series of NaNs by scaling (e.g. ARM/Anthony/ARM_ARM_Anthony_sm_0.025000_0.025000_SMP1-A_19780101_20180712.stm at lon: -98.0969, lat: 37.2134).

This seems to happen in pytesmo.scaling.gen_cdf_match (scaling.py, line 403) because SciPy's InterpolatedUnivariateSpline throws ValueError('x must be strictly increasing'). Pytesmo catches this exception and instead gives a warning: "Too few percentiles for chosen k." (Are those two things really equivalent?). Then pytesmo returns a column of NaNs.

The exception seems to have been introduced in SciPy 1.0 according to this SciPy ticket: https://github.com/scipy/scipy/issues/8535

If I downgrade to scipy 0.19.1 the scaling works (however the metrics calculated later on are still NaNs).

Is there a way to scale these timeseries anyway (and not get just NaNs)? Or are the timeseries just somehow corrupted?

The issue can be reproduced with SciPy 1.1.0 when running:

    def test_strictly_increasing():
        
        n = 1000
        x = np.ones(n) * 0.5
        y = np.ones(n)

        s = scaling.lin_cdf_match(y, x)

awst-baum avatar Jul 25 '18 21:07 awst-baum

This depends on the time series. If there are a lot of identical values then the piecewise linear CDF matching will not work correctly.

In your example the percentiles are

perc_ref
array([ 0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5])
(Pdb) perc_src
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

So there is no sensible way scaling that using CDF matching because there are too few unique percentiles. I would argue that the correct scaling in this case would not be to NaN but to the point 0.5.

There is the function cdf_match which will try to convert the percentiles to unique values before attempting the cdf matching. This might help in your time series but not in the example.

cpaulik avatar Jul 26 '18 06:07 cpaulik

OK, some figures on this: the lin_cdf_match issue occurs for 263 out of 1780 (15%) of ISMN stations in the US from 1978-now (the dataset that Tracy picked for the C3S usecase we're working on). That's quite a lot, isn't it?

The lin_cdf_match percentiles look something like this: [-0.03 0.06 0.06 0.07 0.09 0.1 0.12 0.16 0.23]

I've also tried using cdf_match for the same data and run into a different but probably related issue: InterpolatedUnivariateSpline in gen_cdf_match (scaling.py:402) now throws

error: (xb<=x[0]) failed for 2nd keyword xb: fpcurf0:xb=-nan

because perc_src now looks like this: [ nan nan nan nan nan 0.0501\n 0.0501 0.0501 ...

To be honest, I'm way out of my depth here... ;-)

awst-baum avatar Jul 26 '18 10:07 awst-baum

A few general questions/comments:

  • Does your original time series still contain NaN values?
  • Does from pytesmo.utils import unique_percentiles_interpolate in cdf_match introduce the nan values in the percentiles or are they in there after np.percentile?
  • Are you using the validation framework? If so then I think it should drop nan values before doing any scaling. See https://github.com/TUW-GEO/pytesmo/blob/master/pytesmo/validation_framework/validation.py#L483
  • If not and you want to use any kind of CDF matching then you have to make sure that the percentiles are unique. Otherwise it will not work.

cpaulik avatar Jul 26 '18 11:07 cpaulik

Does your original time series still contain NaN values?

Doesn't seem so. data.isnull().values.any() returns False for at least some of the timeseries that have the problem.

Does from pytesmo.utils import unique_percentiles_interpolate in cdf_match introduce the nan values in the percentiles

Indeed, it does in my tests (with scipy 0.19.1).

Are you using the validation framework?

Yes - and NaNs in the input data don't seem to be the problem, see above.

Should we add one of the affected ISMN files to the test-data?

awst-baum avatar Jul 26 '18 13:07 awst-baum

Does from pytesmo.utils import unique_percentiles_interpolate in cdf_match introduce the nan values in the percentiles

Indeed, it does in my tests (with scipy 0.19.1).

Then we should add a test for this function with the original percentiles that cause this problem. No need for the whole timeseries in the testdata yet.

We have three ways of making sure that the percentiles are unique:

So the best start would be to add the problematic percentiles to the tests for these three functions.

cpaulik avatar Jul 26 '18 13:07 cpaulik

OK!

I'll take a look a this after I'm through the latest proposal drive.

In the meantime, so that I don't forget: the percentiles for ARM_ARM_Anthony_sm_0.025000_0.025000_SMP1-B_19780101_20180712.stm are in the attached textfile.

perc_src.zip

awst-baum avatar Jul 26 '18 14:07 awst-baum

Something strange seems to be going on here. Normally you should only have percentile values for the percentiles [0, 5, 10, 30, 50, 70, 90, 95, 100]. So 9 values. Your zip file has many more.

cpaulik avatar Jul 26 '18 14:07 cpaulik

True.

Just to be clear: I'm talking about cdf_match [scaling.py:344] :

percentiles = np.linspace(0, 100, nbins)
perc_src = np.array(np.percentile(src, percentiles))

nbins is the default 100, so len(percentiles) is 100 and len(perc_src) is also 100.

What's in the textfile is perc_src.

awst-baum avatar Jul 26 '18 16:07 awst-baum

Attaching the offending (reference data) file: data.zip

awst-baum avatar Aug 08 '18 14:08 awst-baum

A unit test that reproduce the problem:

def test_cdf_nan():
    from os import path
    testdata_path = '/space/projects/qa4sm/data/testdata'
    data_file = path.join(testdata_path, 'thedata.txt')
    ref_file = path.join(testdata_path, 'theref.txt')
    the_data = np.loadtxt(data_file)
    the_ref = np.loadtxt(ref_file)

    from pytesmo.scaling import cdf_match
    result = cdf_match(the_data, the_ref)

    print(result)
    
    assert(not np.isnan(result).any())

The necessary data: unittest.zip

awst-baum avatar Aug 08 '18 15:08 awst-baum

I just tested your unittest with my default environment (scipy-0.16.0) and there weren't any nans in result. So there seems to be scipy versions where it a) works (scipy-0.16.0), b) returns nans (scipy-0.19?) and c) returns an error (scipy-1.0).

But also some other module might be the problem. Could you upload a file that specifies your installed modules? E.g. with

conda env export > environment.txt

Then I can install your environment and do some debugging.

Attached is my output , but as a warning, it has way more modules installed than necessary for the above test.

Lmoesinger avatar Aug 23 '18 13:08 Lmoesinger

Thanks for looking at this, greatly appreciated! :)

Here's my environment environment.txt which may also contain more than necessary - plus some installs from non-public EODC GitLab repositories...

Perhaps we should identify a minimum environment for running the test and for reproducing the error.

awst-baum avatar Aug 23 '18 15:08 awst-baum

Both new pytesmo and scipy contribute to this problem:

1) pytesmo>=0.6.8 problem:

scaling.cdf_match() works like this: a) get the values of the percentiles, perc_src b) remove duplicate perc_src and then fill in the removed values by interpolation. But here is the problem: If the differences between the perc_src are very small, the interpolated values might again be duplicates. This then breaks subsequent matching.

IMHO there are two ways to solve this: a) Go back to how it is done in pytesmo 0.6.7 (don´t interpolate the removed percentiles). In very specific cases this might lead to a poor fit tough. b) Use a threshold while determining what is a duplicate value. np.unique() is very susceptible to floating point errors. But the threshold would have to be set with care (probably dependent on relative difference and largest duplicate streak)

But I'm not the one to make that decision.

2) scipy>=1.0 problem:

(occuring for pytesmo 0.6.7, can´t test with current pytesmo because of above problem):

scipy.interpolate.InterpolatedUnivariateSpline breaks if x is not ascending. For some reason perc_src is not always ascending, even if there are no duplicate values in it. Should be fixed with a simple np.sort() in line 351 (current version) in scaling.py:

return gen_cdf_match(src, np.sort(perc_src), np.sort(perc_ref),
                     min_val=min_val, max_val=max_val,
                     k=5)

@awst-baum It seems highly unlikely that real floating point data has that many duplicates - something else seems to have gone wrong in an earlier processing step, I'd suggest looking into that. But in the meanwhile you could use an older pythesmo (0.6.7) and scipy (<1.0) version.

Lmoesinger avatar Aug 24 '18 14:08 Lmoesinger

@Lmoesinger Thanks for the analysis! I guess I have to discuss this with Tracy to figure out what to do.

awst-baum avatar Aug 24 '18 14:08 awst-baum

@Lmoesinger now that you've finished investigating CDF matching, would you be able to comment further on this issue?

tracyscanlon avatar Jul 15 '19 13:07 tracyscanlon

Has been a long time, my memory is a bit fuzzy on this subject, but we need to discuss this point in person: Removal of duplicate percentiles. Id argue that you can only get those if a) there are very few values given the number of percentile bins or b) your data is "bad": Natural data usually does not consist of more than 10% of duplicate (floating point) values. As such I'd generally rather suggest to return an error with a meaningful message rather than not telling the user that either a) he should increase his bin size or b) should check his data.

Lmoesinger avatar Jul 15 '19 13:07 Lmoesinger