dtreeviz icon indicating copy to clipboard operation
dtreeviz copied to clipboard

IndexError: index 22093 is out of bounds for axis 0 with size 22093

Open ghost opened this issue 3 years ago • 15 comments

Hi dtreeviz, I'm trying to visualise a decision tree of an XGBoost regressor model using ShadowXGBDTree(). I'm loading a pre-trained XGBoost model. I can visualise using graphviz and pydotplus, and thought ShadowXGBDTree() could be a nice way to access nodes, edges and build graphs about what the xgboost model is doing.

Can you advise what about my data is causing the Shadow tree to fail? I can successfully run titanic and boston datasets using dtreeviz, so I don't think this is a missing library problem, but probably to do with my data. Do say if I should be posting this on somewhere like Stack Overflow.

My code: model = XGBRegressor() model.load_model(model_file)

df_X = df.drop("smcl", axis = 1) df_y = df["smcl"] xgb_shadow_reg = ShadowXGBDTree(booster = model, tree_index = 1, x_data = df_X, y_data = df_y, feature_names = df_X.columns, target_name = "smcl") The error stack is fairly long, this is the last bit:


IndexError Traceback (most recent call last) c:...\notebook.ipynb Cell 38' in [4]df_X = df.drop("smcl", axis = 1) [5]df_y = df["smcl"] ----> [7]xgb_shadow_reg = ShadowXGBDTree(booster = model, [8] tree_index = 1, [9] x_data = df_X, [10] y_data = df_y, [11] feature_names = df_X.columns, [12] target_name = "smcl")

File ~\anaconda3\envs\project\lib\site-packages\dtreeviz\models\shadow_decision_tree.py:433, in ShadowDecTree._get_tree_nodes..walk(node_id, level) 431 else: # decision node 432 left = walk(children_left[node_id], level + 1) --> 433 right = walk(children_right[node_id], level + 1) 434 t = ShadowDecTreeNode(self, node_id, left, right, level) 435 internal.append(t)

[... skipping similar frames: ShadowDecTree._get_tree_nodes.<locals>.walk at line 432 (2 times)]

File ~\anaconda3\envs\project\lib\site-packages\dtreeviz\models\shadow_decision_tree.py:432, in ShadowDecTree._get_tree_nodes..walk(node_id, level) 430 return t 431 else: # decision node --> 432 left = walk(children_left[node_id], level + 1) 433 right = walk(children_right[node_id], level + 1) 434 t = ShadowDecTreeNode(self, node_id, left, right, level)

File ~\anaconda3\envs\project\lib\site-packages\dtreeviz\models\shadow_decision_tree.py:427, in ShadowDecTree._get_tree_nodes..walk(node_id, level) 426 def walk(node_id, level): --> 427 if children_left[node_id] == -1 and children_right[node_id] == -1: # leaf 428 t = ShadowDecTreeNode(self, node_id, level=level) 429 leaves.append(t)

IndexError: index 22093 is out of bounds for axis 0 with size 22093

Below are some screengrabs to show what my data holds: df_X df_y

Thanks for reading!

ghost avatar Feb 17 '22 17:02 ghost

Interesting. This is the second report we've seen with XGBoost. @tlapusan maybe this one helps us track it down? We think it could be related to pruning.

parrt avatar Feb 17 '22 18:02 parrt

Thanks for the reply and the label. I've recently started using xgboost and it's sure weird compared to random forests.

ghost avatar Feb 17 '22 22:02 ghost

Hi @MartyMcPayne,

we had reported a similar issue few days ago and we didn't succeed to reproduce it using titanic dataset.

I want to ask you if you model/dataset can be shared with us? In this way we could reproduce the bug and try to fix it.

Thanks.

tlapusan avatar Feb 18 '22 06:02 tlapusan

@MartyMcPayne succeeded to reproduce the issues on my own dataset. It was caused be pruning.

Will try to fix it in the next days.

tlapusan avatar Feb 18 '22 07:02 tlapusan

@tlapusan that's great that you found the source of the bug! Fantastic.

ghost avatar Feb 18 '22 08:02 ghost

@MartyMcPayne it should work now. Please follow this PR https://github.com/parrt/dtreeviz/pull/176

tlapusan avatar Feb 21 '22 19:02 tlapusan

@tlapusan fantastic. Will check this out tomorrow (gmt) and let you know.

ghost avatar Feb 21 '22 20:02 ghost

@tlapusan @parrt works fine now! Successfully created shadow decision tree from xgboost. Thanks! 👍

ghost avatar Feb 22 '22 11:02 ghost

@MartyMcPayne if you can double check its structure would be awesome.

I tested it and seems to be fine, but who knows... Thanks.

tlapusan avatar Feb 22 '22 16:02 tlapusan

@tlapusan @parrt Thanks for your work earlier in the week! Using the master dtreeviz library, I got ShadowXGBDTree() to successfully finish, but I get an error when calling trees.dtreeviz() on the ShadowTree object. The function can't successfully return indices for the feature names.

The error stack: AttributeError Traceback (most recent call last) ----> 1 viz = trees.dtreeviz(tree_model = shadowTree, orientation = "LR", depth_range_to_display = (0, 3), precision = 2, colors = {"scatter_marker": "red"}, scale = 2)

File ~\anaconda3\envs\project\lib\site-packages\dtreeviz\trees.py:817, in dtreeviz(tree_model, x_data, y_data, feature_names, target_name, class_names, tree_index, precision, orientation, instance_orientation, show_root_edge_labels, show_node_labels, show_just_path, fancy, histtype, highlight_path, X, max_X_features_LR, max_X_features_TD, depth_range_to_display, label_fontsize, ticks_fontsize, fontname, title, title_fontsize, colors, scale) 805 class_split_viz(node, X_data, y_data, 806 filename=f"{tmp}/node{node.id}{os.getpid()}.svg", 807 precision=precision, (...) 814 fontname=fontname, 815 highlight_node=node.id in highlight_path) 816 else: --> 817 regr_split_viz(node, X_data, y_data, 818 filename=f"{tmp}/node{node.id}{os.getpid()}.svg", 819 target_name=shadow_tree.target_name, 820 y_range=y_range, 821 precision=precision, 822 X=X, 823 ticks_fontsize=ticks_fontsize, 824 label_fontsize=label_fontsize, 825 fontname=fontname, 826 highlight_node=node.id in highlight_path, 827 colors=colors) 829 nname = node_name(node) 830 if not node.is_categorical_split():

File ~\anaconda3\envs\project\lib\site-packages\dtreeviz\trees.py:1125, in regr_split_viz(node, X_train, y_train, target_name, filename, y_range, ticks_fontsize, label_fontsize, fontname, precision, X, highlight_node, colors) 1122 fig, ax = plt.subplots(1, 1, figsize=figsize) 1123 ax.tick_params(colors=colors['tick_label']) -> 1125 feature_name = node.feature_name() 1127 ax.set_xlabel(f"{feature_name}", fontsize=label_fontsize, fontname=fontname, color=colors['axis_label']) 1129 ax.set_ylim(y_range)

File ~\anaconda3\envs\project\lib\site-packages\dtreeviz\models\shadow_decision_tree.py:514, in ShadowDecTreeNode.feature_name(self) 511 """Returns the feature name used at this node""" 513 if self.shadow_tree.feature_names is not None: --> 514 return self.shadow_tree.feature_names[self.feature()] 515 return None

File ~\anaconda3\envs\project_130\lib\site-packages\dtreeviz\models\shadow_decision_tree.py:508, in ShadowDecTreeNode.feature(self) 505 def feature(self) -> int: 506 """Returns feature index used at this node""" --> 508 return self.shadow_tree.get_node_feature(self.id)

File ~\anaconda3\envs\project_130\lib\site-packages\dtreeviz\models\xgb_decision_tree.py:78, in ShadowXGBDTree.get_node_feature(self, id) 76 feature_name = self._get_nodes_values("Feature")[id] 77 try: ---> 78 return self.feature_names.index(feature_name) 79 except ValueError as error: 80 return self.class.NO_FEATURE

AttributeError: 'Index' object has no attribute 'index'

Do let me know if this is an error on my side - I'm working with an xgboost model made with python xgboost 1.3.0, back when xgboost did not incorporate named features, so the model's feature names are f0, f1, f2...

Do also let me know if I need to set this as a separate issue and whether you need anything further from me.

Error stack file below error_stack.txt

ghost avatar Feb 27 '22 16:02 ghost

Hi @MartyMcPayne,

I assume the xgboost version is the issue. Could you upgrade the model to a new version to check ? The error is caused because the dtreeviz cannot find the feature name in the list of specified list of features.

Tudor

tlapusan avatar Feb 27 '22 17:02 tlapusan

@tlapusan Hey Tudor, The problem persists with latest xgboost 1.5.2, I checked after I posted. I assigned feature names and feature types to the model json, but still get the error.

I'll check tomorrow with xgboost regressor on Boston dataset, see if that succeeds.

ghost avatar Feb 27 '22 19:02 ghost

are you using an already saved model ? I had the same use-case, with features f0, f1, etc when saved the model with an older xgboost version and loaded it with a newer one.

tlapusan avatar Feb 27 '22 19:02 tlapusan

Yeah it's an already saved model made with xgboost 1.3.0. I updated to 1.5.2, loaded the old .xgb model, saved as .json, edited the json and pasted the feature_names and feature_type in. Read the json back in but still no success with trees.dtreeviz()

Manually pasting feature_names into the json was a shoddy solution, but I couldn't work out how to assign feature_names to the model in the updated xgboost 1.5.2, I find the xgboost python documentation isn't very helpful

ghost avatar Feb 27 '22 19:02 ghost

@tlapusan Yeah it's definitely my model, or my data. I used xgboost 1.3.0 with the titanic dataset and successfully visualised the tree.

ghost avatar Feb 28 '22 15:02 ghost

@parrt I think this issue can be closed.

mepland avatar Jan 02 '23 18:01 mepland