scikit-tree "Coleman" approach to feature_importances

The basic idea to get a feature_importances distribution map from the Coleman approach is:

train one forest on X
train another forest on permuted X
compute feature_importances array across all trees (n_trees, n_features_in,) for both forests
do the "Coleman" idea and resample from both M times

You have your feature importances null that you can compare against the feature_importances_ array in the first forest in step 1.

Code that builds a Coleman forest approach for doing multivariate hypothesis testing: https://github.com/neurodata/scikit-tree/blob/95e2597d5948c77bea565fc91b82e1a00b43cac8/sktree/stats/forestht.py#L1222

cc: @jovo @jdey4

Feb 22 '24 20:02 adam2392

for step 4, i think we just compute the distribution of feature importance under the null, and then, we can compute a p-value for the importance of each feature under the alternative, right?

Feb 22 '24 20:02 jovo

Oh I guess the permuted forest technically gives that(?), but I was assuming you wanted like M forests each with a slightly different feature_importances map constructed from a different collection of trees?

Feb 22 '24 20:02 adam2392

oh, I thought just 1 null forest. We compute feature_importance for all the features.

We need M forests for the p-value computation for two-sample testing, but I don't think we need more than 1 forest for this?

Feb 22 '24 20:02 jovo

@jovo from the above steps as mentioned by @adam2392, I thought we wanted distribution of feature_importance score. But if I understood it correctly today, you want rank, right? I get rank from the permuted forest and then get rank from non-permuted forest and count the number of times each feature ranks higher in non-permuted one than that in the permuted case? Should I repeat the process several times or you want to subsample after training a random forest with huge number of trees? I repeated the experiment for several reps as the feature dimension is 1.5 million and there is higher variance in forest with 100 trees.

Feb 29 '24 02:02 jdey4

@jdey4 write pseudocode so we are super clear, then i can quibble anything i don't like.

Mar 04 '24 16:03 jovo

Steps:

Consider n iid samples $D_n = (x_i, y_i)_{i=1}^n$ and permuted labels data $\tilde{D}_n$.
Train 2 random forests (B trees each) with $D_n$ and $\tilde{D}_n$ : RF and RF_0, respectively.
Consider a specific feature $F_j$. Calculate it's rank from RF: r
Calculate rank for $F_j$ from $RF_0$: $r_0$.
Calculate $r_0 - r$.
Now randomly sample B trees from ${RF, RF_0}: RF*$ and denote the rest of the trees as $RF*_0$.
Caculate $r*_0-r*$ as 6.
Repeat 6 and 7 $N_0$ times.
Calculate $p = \frac{1}{N_0+1} [1 + \sum I((r_0-r) \leq (r*_0-r*))]$. Thoughts @jovo @adam2392 ? Should I make a PR somewhere in sktree or first make sure it works in the CMI repo?

Mar 10 '24 03:03 jdey4

scikit-tree
scikit-tree copied to clipboard

"Coleman" approach to feature_importances_

scikit-tree scikit-tree copied to clipboard

"Coleman" approach to feature_importances_

scikit-tree
scikit-tree copied to clipboard