pytesmo
pytesmo copied to clipboard
lin_cdf_match and 'x must be strictly increasing' on SciPy > 1.0
I've run into ISMN timeseries files that are turned into a series of NaNs by scaling (e.g. ARM/Anthony/ARM_ARM_Anthony_sm_0.025000_0.025000_SMP1-A_19780101_20180712.stm at lon: -98.0969, lat: 37.2134).
This seems to happen in pytesmo.scaling.gen_cdf_match
(scaling.py, line 403) because SciPy's InterpolatedUnivariateSpline throws ValueError('x must be strictly increasing'). Pytesmo catches this exception and instead gives a warning: "Too few percentiles for chosen k." (Are those two things really equivalent?). Then pytesmo returns a column of NaNs.
The exception seems to have been introduced in SciPy 1.0 according to this SciPy ticket: https://github.com/scipy/scipy/issues/8535
If I downgrade to scipy 0.19.1 the scaling works (however the metrics calculated later on are still NaNs).
Is there a way to scale these timeseries anyway (and not get just NaNs)? Or are the timeseries just somehow corrupted?
The issue can be reproduced with SciPy 1.1.0 when running:
def test_strictly_increasing():
n = 1000
x = np.ones(n) * 0.5
y = np.ones(n)
s = scaling.lin_cdf_match(y, x)
This depends on the time series. If there are a lot of identical values then the piecewise linear CDF matching will not work correctly.
In your example the percentiles are
perc_ref
array([ 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
(Pdb) perc_src
array([ 1., 1., 1., 1., 1., 1., 1., 1., 1.])
So there is no sensible way scaling that using CDF matching because there are too few unique percentiles. I would argue that the correct scaling in this case would not be to NaN
but to the point 0.5
.
There is the function cdf_match which will try to convert the percentiles to unique values before attempting the cdf matching. This might help in your time series but not in the example.
OK, some figures on this: the lin_cdf_match issue occurs for 263 out of 1780 (15%) of ISMN stations in the US from 1978-now (the dataset that Tracy picked for the C3S usecase we're working on). That's quite a lot, isn't it?
The lin_cdf_match percentiles look something like this: [-0.03 0.06 0.06 0.07 0.09 0.1 0.12 0.16 0.23]
I've also tried using cdf_match for the same data and run into a different but probably related issue: InterpolatedUnivariateSpline in gen_cdf_match (scaling.py:402) now throws
error: (xb<=x[0]) failed for 2nd keyword xb: fpcurf0:xb=-nan
because perc_src now looks like this: [ nan nan nan nan nan 0.0501\n 0.0501 0.0501 ...
To be honest, I'm way out of my depth here... ;-)
A few general questions/comments:
- Does your original time series still contain NaN values?
- Does
from pytesmo.utils import unique_percentiles_interpolate
incdf_match
introduce thenan
values in the percentiles or are they in there afternp.percentile
? - Are you using the validation framework? If so then I think it should drop nan values before doing any scaling. See https://github.com/TUW-GEO/pytesmo/blob/master/pytesmo/validation_framework/validation.py#L483
- If not and you want to use any kind of CDF matching then you have to make sure that the percentiles are unique. Otherwise it will not work.
Does your original time series still contain NaN values?
Doesn't seem so. data.isnull().values.any()
returns False for at least some of the timeseries that have the problem.
Does from pytesmo.utils import unique_percentiles_interpolate in cdf_match introduce the nan values in the percentiles
Indeed, it does in my tests (with scipy 0.19.1).
Are you using the validation framework?
Yes - and NaNs in the input data don't seem to be the problem, see above.
Should we add one of the affected ISMN files to the test-data?
Does from pytesmo.utils import unique_percentiles_interpolate in cdf_match introduce the nan values in the percentiles
Indeed, it does in my tests (with scipy 0.19.1).
Then we should add a test for this function with the original percentiles that cause this problem. No need for the whole timeseries in the testdata yet.
We have three ways of making sure that the percentiles are unique:
So the best start would be to add the problematic percentiles to the tests for these three functions.
OK!
I'll take a look a this after I'm through the latest proposal drive.
In the meantime, so that I don't forget: the percentiles for ARM_ARM_Anthony_sm_0.025000_0.025000_SMP1-B_19780101_20180712.stm are in the attached textfile.
Something strange seems to be going on here. Normally you should only have percentile values for the percentiles [0, 5, 10, 30, 50, 70, 90, 95, 100]
. So 9 values. Your zip file has many more.
True.
Just to be clear: I'm talking about cdf_match [scaling.py:344] :
percentiles = np.linspace(0, 100, nbins)
perc_src = np.array(np.percentile(src, percentiles))
nbins is the default 100, so len(percentiles)
is 100 and len(perc_src)
is also 100.
What's in the textfile is perc_src.
Attaching the offending (reference data) file: data.zip
A unit test that reproduce the problem:
def test_cdf_nan():
from os import path
testdata_path = '/space/projects/qa4sm/data/testdata'
data_file = path.join(testdata_path, 'thedata.txt')
ref_file = path.join(testdata_path, 'theref.txt')
the_data = np.loadtxt(data_file)
the_ref = np.loadtxt(ref_file)
from pytesmo.scaling import cdf_match
result = cdf_match(the_data, the_ref)
print(result)
assert(not np.isnan(result).any())
The necessary data: unittest.zip
I just tested your unittest with my default environment (scipy-0.16.0) and there weren't any nans in result
. So there seems to be scipy versions where it a) works (scipy-0.16.0), b) returns nans (scipy-0.19?) and c) returns an error (scipy-1.0).
But also some other module might be the problem. Could you upload a file that specifies your installed modules? E.g. with
conda env export > environment.txt
Then I can install your environment and do some debugging.
Attached is my output , but as a warning, it has way more modules installed than necessary for the above test.
Thanks for looking at this, greatly appreciated! :)
Here's my environment environment.txt which may also contain more than necessary - plus some installs from non-public EODC GitLab repositories...
Perhaps we should identify a minimum environment for running the test and for reproducing the error.
Both new pytesmo and scipy contribute to this problem:
1) pytesmo>=0.6.8 problem:
scaling.cdf_match()
works like this:
a) get the values of the percentiles, perc_src
b) remove duplicate perc_src
and then fill in the removed values by interpolation. But here is the problem: If the differences between the perc_src
are very small, the interpolated values might again be duplicates. This then breaks subsequent matching.
IMHO there are two ways to solve this:
a) Go back to how it is done in pytesmo 0.6.7 (don´t interpolate the removed percentiles). In very specific cases this might lead to a poor fit tough.
b) Use a threshold while determining what is a duplicate value. np.unique()
is very susceptible to floating point errors. But the threshold would have to be set with care (probably dependent on relative difference and largest duplicate streak)
But I'm not the one to make that decision.
2) scipy>=1.0 problem:
(occuring for pytesmo 0.6.7, can´t test with current pytesmo because of above problem):
scipy.interpolate.InterpolatedUnivariateSpline
breaks if x is not ascending. For some reason perc_src
is not always ascending, even if there are no duplicate values in it. Should be fixed with a simple np.sort()
in line 351 (current version) in scaling.py:
return gen_cdf_match(src, np.sort(perc_src), np.sort(perc_ref),
min_val=min_val, max_val=max_val,
k=5)
@awst-baum It seems highly unlikely that real floating point data has that many duplicates - something else seems to have gone wrong in an earlier processing step, I'd suggest looking into that. But in the meanwhile you could use an older pythesmo (0.6.7) and scipy (<1.0) version.
@Lmoesinger Thanks for the analysis! I guess I have to discuss this with Tracy to figure out what to do.
@Lmoesinger now that you've finished investigating CDF matching, would you be able to comment further on this issue?
Has been a long time, my memory is a bit fuzzy on this subject, but we need to discuss this point in person: Removal of duplicate percentiles. Id argue that you can only get those if a) there are very few values given the number of percentile bins or b) your data is "bad": Natural data usually does not consist of more than 10% of duplicate (floating point) values. As such I'd generally rather suggest to return an error with a meaningful message rather than not telling the user that either a) he should increase his bin size or b) should check his data.