Bug: it's possible to define weights that will cause a float roundoff and result in an unhandled index error
I managed to cause an unhandled error in weightedcals by feeding it a large dataframe with large weights that caused a float roundoff in numpy.cumsum.
Reproducible Example:
import numpy as np
import pandas as pd
import weightedcalcs
N = 130_000_000
big_df = pd.DataFrame(dict(
values=np.linspace(0, 1, num=N)/10,
weights=np.linspace(0, 1, num=N)*0.1+50,
))
big_df['weights'] = big_df['weights'].astype('float32')
w_calc = weightedcalcs.Calculator('weights')
w_calc.quantile(big_df, 'values', q=0.9) # breaks, but should produce about 0.9 as an answer
Root-Cause
Here: https://github.com/jsvine/weightedcalcs/blob/cbd2818e6f7ad82c29714f842228bfd4f65c008f/weightedcalcs/core.py#L87
cumsum reaches a float roundoff threshold and plateaus . As the result df["cumul_prop"] never reaches 1. In the example above df["cumul_prop"].max()` is about 0.16.
Then https://github.com/jsvine/weightedcalcs/blob/cbd2818e6f7ad82c29714f842228bfd4f65c008f/weightedcalcs/core.py#L88 yields an empty dataframe, and the next line fails with an IndexError.
Using float64 fixes the above example, but it will inevitably cause the same problem for larger weights or larger dataframes.
HotFix
The issue can be mitigated by splitting the normalization into two steps:
df["weights"] = df["weights"] / df["weights"].sum()
df["cumul_prop"] = df["weights"].cumsum()
Upstream
Numpy devs have an explanation for this behavior and seem to not see it as a bug: https://github.com/numpy/numpy/issues/24443
Next Steps
I can add a PR with the hotfix and with an extra check for this behavior, so at least such cases are caught instead of causing an index error.
Found a related numpy issue: https://github.com/numpy/numpy/issues/24443
Seems like a "won't fix". But also this creates an opportunity to add a custom check for this condition on the side of weightedcalcs. Again, I am happy to submit a PR
p.s. nevermind, normalization solves only a fraction of cases. At a certain df size it is simply necessary to use float64. Still, we should check for whether this is happening.
Thanks for flagging this, @Demetrio92 — what sort of check would you propose?
@jsvine hey, sorry took me a bit to get back to you.
As discussed in that numpy issue, without any assumptions, this type of behavior is difficult to predict. However, I expect arrays of weights to rarely have exact zeros as weights, and for the values to be somewhat balanced. This means eventually cumsum will consistently plateau, and we can check for that. Also, even if users understand that they work with very large or small numbers and could get numerical imprecision, losing 10% or so of your weights sounds undesirable, so I would throw if cumsum deviates from the sum by more than 10%. So, here are two checks I would implement:
# sum sanity check
if abs(1-df["weights"].cumsum()[-1] / df["weights"].sum()) > 0.1:
raise FloatingPointError(
f"Can't normalize weights to probabilities as {self.weight_var}.cumsum() is inconsistent with "
f"{self.weight_var}.sum(), most likely due to a float round-off error. Try using a higher float precision (e.g. "
f"in pandas use `astype('float64')`. Also check for extremely large or small values and treat those manually."
)
# seems like a plateau check
min_len_to_check_plateau = 1000
check_last_n_plateau = 100
if len(df) > min_len_to_check_plateau:
last_n_are_plateau = (np.diff(df["weights"].cumsum()[-check_last_n_plateau:-1]) == 0).all()
last_n_are_non_zero = (df["weights"][-check_last_n_plateau:-1] != 0).any()
if last_n_are_plateau & last_n_are_non_zero:
raise FloatingPointError(
f"Normalizing weights to probabilities seems to fail due to {self.weight_var}.cumsum() plateaing, "
f"most likely due to a float round-off error. Try using a higher float precision (e.g. in pandas use "
f"`astype('float64')`. Also check for extremely large or small values and treat those manually."
)
I'd pack those checks into a function, and add a toggle to the quantile function where the checks are enabled by default, but a user can disable them if they're sure that's what they want. But I generally would expect neither of them to often have false positives. Although, the second one is definitely way more controversial, and maybe just the first one would suffice. Maybe, in that case, I'd keep the second check, but only warn, not raise.
p.s. this is not the first weighted quantile package that is affected. See this numpy issue:
https://github.com/numpy/numpy/issues/9329
Thanks for providing that explanation. The first check seems straightforward, while the second a little less so. A couple more questions:
- In what scenario would the first check pass but the second would not?
- 10% as the threshold for the first check seems generous, at least as an initial reaction. I'd expect
df["weights"].cumsum()[-1]to be much virtually identical todf["weights"].sum()under "safe" conditions. But am I missing something / misunderstanding?