patsy
patsy copied to clipboard
rounding-off comparison with B-spline's bounds
This one's a minor fix imposing approximate comparison in floats.
Using isclose
like this makes me nervous – its definition of "close" is pretty arbitrary, which is fine in a test suite, but dangerous when embedded inside a library like this.
Can you give some more details about how you ran into this and why you need it?
Its the classic float-comparison error that prompted me to push this change while using dmatrices
API. The error said:
y, X = dmatrices(formula, data, return_type='dataframe')
File "/usr/local/lib/python3.5/site-packages/patsy/highlevel.py", line 310, in dmatrices
NA_action, return_type)
File "/usr/local/lib/python3.5/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design
NA_action)
File "/usr/local/lib/python3.5/site-packages/patsy/highlevel.py", line 70, in _try_incr_builders
NA_action)
File "/usr/local/lib/python3.5/site-packages/patsy/build.py", line 689, in design_matrix_builders
factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)
File "/usr/local/lib/python3.5/site-packages/patsy/build.py", line 370, in _factors_memorize
factor.memorize_finish(factor_states[factor], which_pass)
File "/usr/local/lib/python3.5/site-packages/patsy/eval.py", line 561, in memorize_finish
state["transforms"][obj_name].memorize_finish()
File "/usr/local/lib/python3.5/site-packages/patsy/splines.py", line 223, in memorize_finish
lower_bound))
ValueError: some knot values ([ 60866.115]) fall below lower bound (60866.114999999998)
Now clearly 60866.115
> 60866.114999999998
but you see that error just because - floats.
I see you make a valid point that isclose is arbitrary but the other option I thought was
np.any(inner_knots-lower_bound < -1 * SMALL_POSITIVE_CONSTANT)
but that again requires you to define this SMALL_POSITIVE_CONSTANT
or let the user define it as an argument. If we do go by the latter route, isclose
can take arguments rtol
, atol
to that effect. What do you think?
Are you using the default knots and bounds? Are you able to share the data that's triggering this?
Yes the knots and bounds used are both default. However df
, degree
and include_intercept
are passed on to BS
. Unfortunately I can't share the data. I confirm that once doing the comparison in either of the specified options, error went away - that means, inner_knots
were just marginally greater than the lower bound
Looking at the code, it appears that what we're doing is comparing np.min(data)
to some percentiles of data
. It's extremely weird that np.percentile
is returning a value that is smaller than np.min(data)
, especially since we're using non-zero percentiles.
Can you try adding
print(inner_knots)
print(knot_quantiles)
to BS.memorize_finish
just before the error, and paste what you get?
Its actually not returning a smaller value, you can see that the knot reported in the output above is just greater than the bound. Nonetheless, I will paste these values shortly if I can reproduce it.
Ah, no, I missed that, but it's just an issue with the numpy array printing rounding off the output. Floats do follow rules, you don't just randomly get a < b
returning the wrong result because a
is only a little bit bigger than b
:-).
However, this means that my suggestion of print
ing the value isn't going to produce any useful output either, because numpy will round them off for display again :-(. So actually, can you try:
np.set_printoptions(precision=20)
print(inner_knots)
print(knot_quantiles)
?
Sorry I can't seem to get the data back again. Floats returning wrong results won't happen always for you, I think it depends on what values you are dealing with (binary conversion has to kick in).
In [29]: float(60866.114999999999) > float(60866.114999999998)
Out[29]: False
In [30]: float(1.222222222223)>float(1.222222222222)
Out[30]: True
I guess I will wait till I can reproduce the issue again and get back
In [29]: float(60866.114999999999) > float(60866.114999999998) Out[29]: False
That's because these two values got rounded to be ==
. But this error won't happen if np.percentile(data, x) == np.min(data)
, we have to actually have np.percentile(data, x) < np.min(data)
.
https://github.com/numpy/numpy/issues/10373 gives an example of a case where this could happen.
I'm thinking the right solution here is to clip the auto-generated knots to fall inside the lower/upper bounds.
I doubt if clipping or any absolute comparison methods would help (otherwise we won't get an incorrect percentile at first place). I think at the end of the day it boils down to having approximate comparison somewhere.
We probably get the incorrect percentile because of some rounding error when the percentile
code tries to make a weighted average between two entries in your data. E.g. if you only have 50 data points, then the smallest is the 0th percentile, and the second-smallest is the 2nd percentile, and if we do np.percentile(data, 1)
to ask for the 1st percentile then... there is no entry in the data that corresponds to that, so it has to approximate it by averaging together the two closest points. This might happen when those two values are actually the same, as in the numpy bug report linked above.
You seem to be under the impression that floating point math is all about vague clouds of points and impossible to predict results. This is a common and simplification, but it's not true :-). Each floating point value represents a specific real number. When you perform some operation like multiplication or addition (like when computing a weighted average) then there is always an exact mathematical answer which is also a real number... it's just that it might not be one of the real numbers that can be written down as a floating point number, so the operation instead picks some nearby number that is representable. But then that is represented exactly. Comparisons are performed exactly on the given values. Etc.
If we used something like np.clip
to nudge the knots into the range of the lower/upper bound, then they'll stay there :-)
Just to mention that I have seen the same exception a few times. My guess was that there are a few nulps differences in the floating point computations. AFAIR it doesn't happen with "nice" numbers but it happened with some simulated sin/cos data. Unfortunately I don't find my statsmodels comments for this right now. https://github.com/statsmodels/statsmodels/pull/5296 (which has too many comments, too many issues, and too many moving parts.)
What I did as workaround was to set the boundary points so that they are 1e-15 (or something like that) outside of the data interval defined by min and max.
fyi: This is not an issue in formulaic, which allows extrapolation of the basis splines beyond the boundary knots when so configured. It's likely that this patch would work fine, but I would have to look more closely at the code to see. I also note that patsy's implementation has some weird behaviour on the boundaries anyway (I haven't looked into whether it is worth fixing).
Let me know if you are still interested in getting this merged into patsy!