Metrics and scale
Hi, very interesting work, but I was wondering how do you guarantee the same scale, maybe i missed in the code for RF, but it seems mainly feature perm importance (the regular feat importance is biased anyways for most tree based models). How can you compare one importance over another on the same scale, do you need to calibrate the model first (potentially another step) if you want to use importance / predictions and compare between them.
You mention correlation but seems like the metrics is definitely not symmetric and not sure if it actually satisfies the inequality so how can it be used in actual workflows to compare apples to apples, would love if you have a draft of the arxiv paper and those are touched upon or maybe details in the code i missed. Thanks!
@feribg I don't think you've missed anything :smile: The arxiv draft is almost ready for early circulation for feedback, so I'd be happy to add you to the list if you'd like to share an email. You can write me at the email on my profile.
For your specific questions here:
-
Scaling: As you noted, most of the "underlying" inference methods and "importance" calculations do not produce scaled quantities. There are a number of choices to be made about this scaling, and so these quantities available in the library today are probably better viewed as analogous to MI, not correlation (or even coviariance).
-
Importance: You might notice that both mean impurity and permutation methods are currently implemented for most inference methods. There are also even simpler methods like "proportion of trees in which feature X_i occurs (normalized in some cases possibly by the proportion of features that are selected in subspace sampling)." It's not clear which of these is most useful or what trade-offs might manifest in this context.
NB: This is all relative :wink: While feature importance calculations all have some weakness, so do other corr or MI formulations.
-
Calibration: Yes, you are headed in the same direction I am currently investigating. Of note additionally is that some bootstrapping or calibration will also allow for the calculation of confidence intervals or other uncertainty estimates.
-
Symmetry: Since my motivation is based on issues I have with axiomatic cov/corr in practice, the lack of symmetry is more of a feature than a bug :wink: In principle, this possibly reveals that we are closer to measuring Y~X and X~Y than \rho(X,Y), but from a cognitive perspective, humans don't seem to disentangle the concepts anyway...
I have experimented with symmetrization via the expected routes - copying upper/lower triangular forms and \frac{R+R^T}{2}.
- Application In practice, many forms of correlation, MI, etc. are already not comparable either across each other or even with mixed variable types. For example, if I have categorical or discrete data mixed with real-valued data, like log-return and an ordinal/categorical expert opinion, most correlation methods are already useless.
Mixing categorical features and non-categorical features together is also a key motivator behind this experiment.
Thank you again for sharing your questions! Happy to answer any others you may have
Thanks for the quick response, I will ping you via email to take a read through first, before asking / contributing more. Again great idea and the fact it's simple-ish makes it even better! But just to add a few words to what you already mentioned (also coming from the finance background, so probably most relevant to your work).
Scaling: I get that point and it does resamble MI / VOI the most, the part where im not certain is whether even within the same experiment the measure is consistent, IE feature imp score of 0.8 for one data sample and 0.7 for another might not actually mean that 0.7 is less informative (maybe the bootstrapping ideas you suggest improve that to some level, but still curious if there's a theoretical way to prove that).
Importance: yep noted, i agree that making this pluggable is great, can be any of permutation, N-1 importance, shap, whatever, to the previous point though, those likely behave differently and consequently the comparison of their output values might need different adjustment (that's a wild guess on my part). To your point this is likely very open to research but specifically in the finance context N-1 (removing a feature and retraining a new model, rather than permuting it) might be more stable because of the noise and non-IID).
Calibration this does a reasonably good job at trying to be more robust and resampling etc https://github.com/scikit-learn-contrib/boruta_py as an inspiration.
Symmetry agreed, i guess the name rfcorr kinda throws me off that i should think of it as corr, but it's more a similarity measure. might however need some sanity check IE if x predicts y really well, but not vice versa, then maybe the underlying "model" or similarity estimator is somehwat biased or misspecified, curious what your thoughts are? I cant think of many cases that would happen in an RF context.