cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[FEA] Add support for computing feature_importances in RF

Open teju85 opened this issue 3 years ago • 9 comments

Is your feature request related to a problem? Please describe. RF implementation should support computing feature_importances_ property, just like how it is exposed in sklearn.

Describe the solution you'd like

  1. By default, we should compute normalized feature_importances_ (ie. all the importances across the features sum to 1.0)
  2. Implementation that is done in sklearn is here. We have all of this information in our Node. We just need to, while building the tree, keep accumulating each feature's importance as we keep adding more nodes.

teju85 avatar Feb 19 '21 09:02 teju85

Definitely agreed. Not sure we'll have enough bandwidth to get this for 0.19 (given work going to new backend) but should be highly prioritized high after that.

JohnZed avatar Feb 19 '21 18:02 JohnZed

Here's one use-case that requires this attribute to be present: https://github.com/willb/fraud-notebooks/blob/develop/03-model-random-forest.ipynb

teju85 avatar Feb 25 '21 05:02 teju85

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Mar 27 '21 06:03 github-actions[bot]

we are interested to use this feature in our use case too.

sooryaa-thiruloga avatar Jul 14 '21 20:07 sooryaa-thiruloga

This would also be useful for tools like Boruta, a popular feature selection library that's part of scikit-learn-contrib. There is a Boruta issue asking for support for cuML estimators

beckernick avatar Mar 18 '22 16:03 beckernick

Tagging @vinaydes and @venkywonka to see if we can have Venkat start on this?

teju85 avatar Mar 18 '22 16:03 teju85

This is probably not the most efficient implementation, but in case anyone else needs it:

def calculate_importances(nodes, n_features):
    importances = np.zeros((len(nodes), n_features))
    feature_gains = np.zeros(n_features)


    def calculate_node_importances(node, i_root):
        if "gain" not in node:
            return

        samples = node["instance_count"]
        gain = node["gain"]
        feature = node["split_feature"]
        feature_gains[feature] += gain * samples

        for child in node["children"]:
            calculate_node_importances(child, i_root)


    for i, root in enumerate(nodes):
        calculate_node_importances(root, i)
        importances[i] = feature_gains / feature_gains.sum()

    return np.mean(importances, axis=0)

you can see the logic behind it here https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3

hafarooki avatar Apr 16 '22 09:04 hafarooki

Cross linking an issue that asks for this feature and OOB support https://github.com/rapidsai/cuml/issues/3361

beckernick avatar Jun 29 '22 18:06 beckernick

it is an important issue worth a look.

Wulin-Tan avatar Aug 28 '22 16:08 Wulin-Tan