patsy icon indicating copy to clipboard operation
patsy copied to clipboard

rounding-off comparison with B-spline's bounds

Open arbazkhan002 opened this issue 6 years ago • 14 comments

This one's a minor fix imposing approximate comparison in floats.

arbazkhan002 avatar Jan 09 '18 01:01 arbazkhan002

Using isclose like this makes me nervous – its definition of "close" is pretty arbitrary, which is fine in a test suite, but dangerous when embedded inside a library like this.

Can you give some more details about how you ran into this and why you need it?

njsmith avatar Jan 10 '18 22:01 njsmith

Its the classic float-comparison error that prompted me to push this change while using dmatrices API. The error said:

y, X = dmatrices(formula, data, return_type='dataframe')
  File "/usr/local/lib/python3.5/site-packages/patsy/highlevel.py", line 310, in dmatrices
    NA_action, return_type)
  File "/usr/local/lib/python3.5/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design
    NA_action)
  File "/usr/local/lib/python3.5/site-packages/patsy/highlevel.py", line 70, in _try_incr_builders
    NA_action)
  File "/usr/local/lib/python3.5/site-packages/patsy/build.py", line 689, in design_matrix_builders
    factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)
  File "/usr/local/lib/python3.5/site-packages/patsy/build.py", line 370, in _factors_memorize
    factor.memorize_finish(factor_states[factor], which_pass)
  File "/usr/local/lib/python3.5/site-packages/patsy/eval.py", line 561, in memorize_finish
    state["transforms"][obj_name].memorize_finish()
  File "/usr/local/lib/python3.5/site-packages/patsy/splines.py", line 223, in memorize_finish
    lower_bound))
ValueError: some knot values ([ 60866.115]) fall below lower bound (60866.114999999998)

Now clearly 60866.115 > 60866.114999999998 but you see that error just because - floats.

I see you make a valid point that isclose is arbitrary but the other option I thought was np.any(inner_knots-lower_bound < -1 * SMALL_POSITIVE_CONSTANT) but that again requires you to define this SMALL_POSITIVE_CONSTANT or let the user define it as an argument. If we do go by the latter route, isclose can take arguments rtol, atol to that effect. What do you think?

arbazkhan002 avatar Jan 10 '18 22:01 arbazkhan002

Are you using the default knots and bounds? Are you able to share the data that's triggering this?

njsmith avatar Jan 10 '18 23:01 njsmith

Yes the knots and bounds used are both default. However df, degree and include_intercept are passed on to BS. Unfortunately I can't share the data. I confirm that once doing the comparison in either of the specified options, error went away - that means, inner_knots were just marginally greater than the lower bound

arbazkhan002 avatar Jan 11 '18 00:01 arbazkhan002

Looking at the code, it appears that what we're doing is comparing np.min(data) to some percentiles of data. It's extremely weird that np.percentile is returning a value that is smaller than np.min(data), especially since we're using non-zero percentiles.

Can you try adding

print(inner_knots)
print(knot_quantiles)

to BS.memorize_finish just before the error, and paste what you get?

njsmith avatar Jan 11 '18 01:01 njsmith

Its actually not returning a smaller value, you can see that the knot reported in the output above is just greater than the bound. Nonetheless, I will paste these values shortly if I can reproduce it.

arbazkhan002 avatar Jan 11 '18 01:01 arbazkhan002

Ah, no, I missed that, but it's just an issue with the numpy array printing rounding off the output. Floats do follow rules, you don't just randomly get a < b returning the wrong result because a is only a little bit bigger than b :-).

However, this means that my suggestion of printing the value isn't going to produce any useful output either, because numpy will round them off for display again :-(. So actually, can you try:

np.set_printoptions(precision=20)
print(inner_knots)
print(knot_quantiles)

?

njsmith avatar Jan 11 '18 01:01 njsmith

Sorry I can't seem to get the data back again. Floats returning wrong results won't happen always for you, I think it depends on what values you are dealing with (binary conversion has to kick in).

In [29]: float(60866.114999999999) > float(60866.114999999998)
Out[29]: False

In [30]: float(1.222222222223)>float(1.222222222222)
Out[30]: True

I guess I will wait till I can reproduce the issue again and get back

arbazkhan002 avatar Jan 11 '18 02:01 arbazkhan002

In [29]: float(60866.114999999999) > float(60866.114999999998)
Out[29]: False

That's because these two values got rounded to be ==. But this error won't happen if np.percentile(data, x) == np.min(data), we have to actually have np.percentile(data, x) < np.min(data).

njsmith avatar Jan 11 '18 03:01 njsmith

https://github.com/numpy/numpy/issues/10373 gives an example of a case where this could happen.

I'm thinking the right solution here is to clip the auto-generated knots to fall inside the lower/upper bounds.

njsmith avatar Jan 12 '18 01:01 njsmith

I doubt if clipping or any absolute comparison methods would help (otherwise we won't get an incorrect percentile at first place). I think at the end of the day it boils down to having approximate comparison somewhere.

arbazkhan002 avatar Jan 14 '18 20:01 arbazkhan002

We probably get the incorrect percentile because of some rounding error when the percentile code tries to make a weighted average between two entries in your data. E.g. if you only have 50 data points, then the smallest is the 0th percentile, and the second-smallest is the 2nd percentile, and if we do np.percentile(data, 1) to ask for the 1st percentile then... there is no entry in the data that corresponds to that, so it has to approximate it by averaging together the two closest points. This might happen when those two values are actually the same, as in the numpy bug report linked above.

You seem to be under the impression that floating point math is all about vague clouds of points and impossible to predict results. This is a common and simplification, but it's not true :-). Each floating point value represents a specific real number. When you perform some operation like multiplication or addition (like when computing a weighted average) then there is always an exact mathematical answer which is also a real number... it's just that it might not be one of the real numbers that can be written down as a floating point number, so the operation instead picks some nearby number that is representable. But then that is represented exactly. Comparisons are performed exactly on the given values. Etc.

If we used something like np.clip to nudge the knots into the range of the lower/upper bound, then they'll stay there :-)

njsmith avatar Jan 15 '18 10:01 njsmith

Just to mention that I have seen the same exception a few times. My guess was that there are a few nulps differences in the floating point computations. AFAIR it doesn't happen with "nice" numbers but it happened with some simulated sin/cos data. Unfortunately I don't find my statsmodels comments for this right now. https://github.com/statsmodels/statsmodels/pull/5296 (which has too many comments, too many issues, and too many moving parts.)

What I did as workaround was to set the boundary points so that they are 1e-15 (or something like that) outside of the data interval defined by min and max.

josef-pkt avatar Oct 31 '18 14:10 josef-pkt

fyi: This is not an issue in formulaic, which allows extrapolation of the basis splines beyond the boundary knots when so configured. It's likely that this patch would work fine, but I would have to look more closely at the code to see. I also note that patsy's implementation has some weird behaviour on the boundaries anyway (I haven't looked into whether it is worth fixing).

Let me know if you are still interested in getting this merged into patsy!

matthewwardrop avatar Oct 20 '21 03:10 matthewwardrop