feature_engine
feature_engine copied to clipboard
Adding auto threshold to DropHighPSIFeatures
Is your feature request related to a problem? Please describe. Staple thresholds 0.1 and 0.25 are empirical, but there are alternatives to calculate threshold based on data and parameters of the transformation
Describe the solution you'd like There are two formulas described in this dissertation to calculate psi threshold from number of bins and number of datapoints in base and test datasets. From my experiments Chi2(0.999, bins) works best. We could keep default parameter as 0.25, but also provide 'auto' option.
Thanks @GLevV for the suggestion.
@gverbock what do you think about this?
I like the idea to use the number of points to set a common threshold for all features. On the other hand the number of bins is arbitrary so it there will be an arbitrary factor anyway (still need to read the dissertation on that topic). @GLevV what is your view on this?
@gverbock yes, threshold will depend on size of dataset (and on split_frac
) and number of bins, which makes sense since larger number of bins would be more sensitive to changes in distribution and formulas will reflect that change. In practice they give lower thresholds than empirical 0.1 or 0.25 in most cases, but we could keep the defaults as is.
Should also be pretty straightforward to implement, I think
@GLevV would you like to give it a go?
@solegalli ye, why not
closed with #498