feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

Adding auto threshold to DropHighPSIFeatures

Open glevv opened this issue 2 years ago • 5 comments

Is your feature request related to a problem? Please describe. Staple thresholds 0.1 and 0.25 are empirical, but there are alternatives to calculate threshold based on data and parameters of the transformation

Describe the solution you'd like There are two formulas described in this dissertation to calculate psi threshold from number of bins and number of datapoints in base and test datasets. From my experiments Chi2(0.999, bins) works best. We could keep default parameter as 0.25, but also provide 'auto' option.

glevv avatar Aug 06 '22 07:08 glevv

Thanks @GLevV for the suggestion.

@gverbock what do you think about this?

solegalli avatar Aug 06 '22 13:08 solegalli

I like the idea to use the number of points to set a common threshold for all features. On the other hand the number of bins is arbitrary so it there will be an arbitrary factor anyway (still need to read the dissertation on that topic). @GLevV what is your view on this?

gverbock avatar Aug 08 '22 09:08 gverbock

@gverbock yes, threshold will depend on size of dataset (and on split_frac) and number of bins, which makes sense since larger number of bins would be more sensitive to changes in distribution and formulas will reflect that change. In practice they give lower thresholds than empirical 0.1 or 0.25 in most cases, but we could keep the defaults as is.

Should also be pretty straightforward to implement, I think

glevv avatar Aug 08 '22 10:08 glevv

@GLevV would you like to give it a go?

solegalli avatar Aug 08 '22 10:08 solegalli

@solegalli ye, why not

glevv avatar Aug 08 '22 12:08 glevv

closed with #498

glevv avatar Aug 25 '22 05:08 glevv