Segmentation fault when trying to get feature importance of multilabel binary classifier
- Operating System: linux
- Python Version: 3.10.14
- XGBoost Version: 2.1.0
I am experiencing a segmentation fault with XGBoost 2.1.0 when trying to access feature importances in a multi-label binary classification model. The model trains and predicts as expected; however, when I attempt to retrieve feature importances using either xgb_model.feature_importances_ or xgb_model.get_score(importance_type='weight'), the process fails. In a Jupyter kernel, this results in a kernel crash, and when executed from the terminal, it outputs "Segmentation fault". The issue occurs specifically under these conditions, without any problems during other operations like fitting or predicting.
Thank you for sharing! Will try to reproduce it.
Hi @shreyaspuducheri23 , could you please share a reproducible example? I tried to following toy example and did not observe a segfault:
from sklearn.datasets import make_multilabel_classification
import xgboost as xgb
X, y = make_multilabel_classification()
clf = xgb.XGBClassifier()
clf.fit(X, y)
clf.feature_importances_
clf.get_booster().get_score(importance_type='weight')
Hi @trivialfis the issue arrises when using the vector leaf option:
X, y = make_multilabel_classification(n_classes=2, n_labels=2,
allow_unlabeled=False,
random_state=1)
clf = xgb.XGBClassifier(multi_strategy='multi_output_tree')
clf.fit(X, y)
clf.feature_importances_
clf.get_booster().get_score(importance_type='weight')
Ah, the parameter is still working in progress. Will implement feature importance after sorting out some current work.
I see, thank you! Do you have an estimated time frame- i.e. weeks, months, etc.? Just wondering whether it would be in my best interest to wait for the feature or just switch to one-output-per-tree for my current project.
Opened a PR to add support for weight: https://github.com/dmlc/xgboost/pull/10700 . Other types can take some time, I don't have an eta yet.
If the PR is approved, you can use the nightly build for testing.
@trivialfis, I'm here because of the same issue @shreyaspuducheri23 has. I can see that your last change ( #10700) is approved and merged but I still can't access feature importance properly (When I tried, it just returned 0.0 as the feature importance for all my features) when I set multi_strategy to multi_output_tree.
On a separate note, when I set multi_strategy to one_output_per_tree, I get a single 1D array of feature importance (even though I have 3 labels). What's going on under the hood, I was expecting to get feature importance for each label since three different independent models are built.
I would like to work on this
I was expecting to get feature importance for each label since three different independent models are built.
They were combined to represent the whole model instead of individual models.
I would like to work on this
Thank you for volunteering! Maybe https://github.com/dmlc/xgboost/pull/10700 can be a good start for looking into where it's calculated?
Thanks @trivialfis for your response. When you say they were combined, what combination method is used? Is it average of all feature importance across all the models for each feature?
Either total or average, depending on the type of the gain you specified.
Closing as the original issue was resolved. Further feature coverage will be tracked at https://github.com/dmlc/xgboost/issues/9043 .