mlxtend icon indicating copy to clipboard operation
mlxtend copied to clipboard

feature_index in plot_decision_regions defaults to (0,1) even if those are filler columns

Open MaxPowerWasTaken opened this issue 7 years ago • 3 comments

If no feature_index is given, it defaults to (0,1), even if 0 and 1 are given as 'filler features'. Alternatively, maybe feature_index should default to the non-filler-features in this case? (Or vice versa?).

I know the error thrown in this case is admirably helpful and specific, so maybe this isn't worth bringing up...

Reproduceable code example of issue (if it's considered a real 'issue'):

from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
import xgboost as xgb

# Get example data (iris.data has 4 columns)
iris = load_iris()
X = iris.data
y = iris.target

# Fit classifier
clf = xgb.XGBClassifier()
clf.fit(X, y)

# Plot decision boundary region, first two col indices should be ignored
# w/ 'filler' values
arb = 5
filler_feature_values = {0:arb, 1:arb}

fig, ax = plt.subplots()
plot_decision_regions(X, y, clf=clf, filler_feature_values=filler_feature_values, ax=ax)

Result: ValueError: Column(s) [2 3] need to be accounted for in either feature_index or filler_feature_values

MaxPowerWasTaken avatar Mar 20 '18 23:03 MaxPowerWasTaken

Good point. I don't have a strong preference, here, but I think that auto-assigning remaining columns if filler_feature_values are set would add additional convenience as in the vice versa scenario: auto-assigning filler_feature_values if feature_index is specified.

rasbt avatar Mar 21 '18 03:03 rasbt

I am really not sure what the issue is. But, plotting SVM text classification is such a problem with this one (or with the libraries I have been exploring). By default, if there are more than 2, 3 whatever number features there are, there should be some sort of warning with values filled up automatically. Not a blocking error.

CognitiveClouds-Prasad avatar Jul 19 '19 10:07 CognitiveClouds-Prasad

Thanks for the note. I think the problem is with coming up with a good filler value. I think it shouldn't be hard-coded. For instance, if someone plots the decision regions on an unstandardized dataset, 0 as the filler value may make sense. However, if the dataset is an unscaled version of Iris, then 0 would be absolute nonsense.

So, we maybe want to use sth like the feature "mean" or "median" as the default filler value, I guess. Median may be the safer bet in case the feature is categorical.

rasbt avatar Jul 19 '19 11:07 rasbt