xgboost Segmentation fault when trying to get feature importance of multilabel binary classifier

Operating System: linux
Python Version: 3.10.14
XGBoost Version: 2.1.0

I am experiencing a segmentation fault with XGBoost 2.1.0 when trying to access feature importances in a multi-label binary classification model. The model trains and predicts as expected; however, when I attempt to retrieve feature importances using either xgb_model.feature_importances_ or xgb_model.get_score(importance_type='weight'), the process fails. In a Jupyter kernel, this results in a kernel crash, and when executed from the terminal, it outputs "Segmentation fault". The issue occurs specifically under these conditions, without any problems during other operations like fitting or predicting.

Aug 09 '24 03:08 shreyaspuducheri23

Thank you for sharing! Will try to reproduce it.

Aug 10 '24 08:08 trivialfis

Hi @shreyaspuducheri23 , could you please share a reproducible example? I tried to following toy example and did not observe a segfault:

from sklearn.datasets import make_multilabel_classification
import xgboost as xgb


X, y = make_multilabel_classification()
clf = xgb.XGBClassifier()
clf.fit(X, y)
clf.feature_importances_
clf.get_booster().get_score(importance_type='weight')

Aug 12 '24 17:08 trivialfis

Hi @trivialfis the issue arrises when using the vector leaf option:

X, y = make_multilabel_classification(n_classes=2, n_labels=2,
                                      allow_unlabeled=False,
                                      random_state=1)

clf = xgb.XGBClassifier(multi_strategy='multi_output_tree')
clf.fit(X, y)
clf.feature_importances_
clf.get_booster().get_score(importance_type='weight')

Aug 12 '24 18:08 shreyaspuducheri23

Ah, the parameter is still working in progress. Will implement feature importance after sorting out some current work.

Aug 12 '24 18:08 trivialfis

I see, thank you! Do you have an estimated time frame- i.e. weeks, months, etc.? Just wondering whether it would be in my best interest to wait for the feature or just switch to one-output-per-tree for my current project.

Aug 12 '24 19:08 shreyaspuducheri23

Opened a PR to add support for weight: https://github.com/dmlc/xgboost/pull/10700 . Other types can take some time, I don't have an eta yet.

If the PR is approved, you can use the nightly build for testing.

Aug 13 '24 06:08 trivialfis

@trivialfis, I'm here because of the same issue @shreyaspuducheri23 has. I can see that your last change ( #10700) is approved and merged but I still can't access feature importance properly (When I tried, it just returned 0.0 as the feature importance for all my features) when I set multi_strategy to multi_output_tree.

On a separate note, when I set multi_strategy to one_output_per_tree, I get a single 1D array of feature importance (even though I have 3 labels). What's going on under the hood, I was expecting to get feature importance for each label since three different independent models are built.

Sep 05 '24 17:09 abseejp

I would like to work on this

Sep 07 '24 02:09 AnthonyYao7

I was expecting to get feature importance for each label since three different independent models are built.

They were combined to represent the whole model instead of individual models.

I would like to work on this

Thank you for volunteering! Maybe https://github.com/dmlc/xgboost/pull/10700 can be a good start for looking into where it's calculated?

Sep 19 '24 18:09 trivialfis

Thanks @trivialfis for your response. When you say they were combined, what combination method is used? Is it average of all feature importance across all the models for each feature?

Sep 19 '24 20:09 abseejp

Either total or average, depending on the type of the gain you specified.

Sep 20 '24 22:09 trivialfis

Closing as the original issue was resolved. Further feature coverage will be tracked at https://github.com/dmlc/xgboost/issues/9043 .

Feb 13 '25 21:02 trivialfis