optbinning
optbinning copied to clipboard
Summary statistics could be incorrect when using
I created a simple dataframe with age, salary, and num_obs:
import pandas as pd
from optbinning import ContinuousOptimalBinning
df = pd.DataFrame({'age': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'salary': {0: 0.7739560485559633,
1: 0.4388784397520523,
2: 0.8585979199113825,
3: 0.6973680290593639,
4: 0.09417734788764953,
5: 0.9756223516367559,
6: 0.761139701990353,
7: 0.7860643052769538,
8: 0.12811363267554587,
9: 0.45038593789556713},
'num_obs': {0: 5, 1: 4, 2: 3, 3: 7, 4: 6, 5: 6, 6: 5, 7: 7, 8: 5, 9: 5}})
Better displayed as:
age | salary | num_obs |
---|---|---|
1 | 0.773956 | 5 |
2 | 0.438878 | 4 |
3 | 0.858598 | 3 |
4 | 0.697368 | 7 |
5 | 0.0941773 | 6 |
6 | 0.975622 | 6 |
7 | 0.76114 | 5 |
8 | 0.786064 | 7 |
9 | 0.128114 | 5 |
10 | 0.450386 | 5 |
I then generated optimal bins using num_obs as sample weights:
optb = ContinuousOptimalBinning(dtype="numerical")
optb.fit(df['age'], df['salary'], sample_weight=df['num_obs'])
binning_table = optb.binning_table
binning_table.build()
Which results in:
Bin | Count | Count (%) | Sum | Std | Mean | Min | Max | Zeros count | WoE | IV | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 1.50) | 5 | 0.0943396 | 3.86978 | 1.5479120971119267 | 0.773956 | 3.86978 | 3.86978 | 0 | 0.175803 | 0.0165852 |
1 | [1.50, 4.50) | 14 | 0.264151 | 9.21288 | 1.401113568517211 | 0.658063 | 1.75551 | 4.88158 | 0 | 0.0599101 | 0.0158253 |
2 | [4.50, 8.50) | 24 | 0.45283 | 15.7269 | 1.6960750859575562 | 0.655289 | 0.565064 | 5.85373 | 0 | 0.0571365 | 0.0258731 |
3 | [8.50, 9.50) | 5 | 0.0943396 | 0.640568 | 0.25622726535109175 | 0.128114 | 0.640568 | 0.640568 | 0 | -0.470039 | 0.0443433 |
4 | [9.50, inf) | 5 | 0.0943396 | 2.25193 | 0.9007718757911344 | 0.450386 | 2.25193 | 2.25193 | 0 | -0.147767 | 0.0139403 |
5 | Special | 0 | 0 | 0 | nan | 0 | nan | nan | 0 | -0.598153 | 0 |
6 | Missing | 0 | 0 | 0 | nan | 0 | nan | nan | 0 | -0.598153 | 0 |
Totals | 53 | 1 | 31.7021 | 0.598153 | 0.565064 | 5.85373 | 0 | 2.10696 | 0.116567 |
Notice how row 3 (with bin [8.50, 9.50)) has Std different than 0. Since the only age that falls on that bin is 8, I don't understand how the std could be different than 0. The other statistics are also quite odd/don't make sense.
Please let me know if there is an issue when using weights or if I'm understanding the results wrong.
Thanks!
Ps: this might be related to this issue: https://github.com/guillermo-navas-palencia/optbinning/issues/323