StackingCVClassifier fails on pandas DataFrames
I am attempting to use StackingCVClassifier where the base models are sklearn pipelines. The pipelines use sklearn.compose.ColumnTransformer and mlxtend.feature_selection.ColumnSelector. As such, when I call fit(...) I pass in a pandas DataFrame as my X (ColumnTransformer and ColumnSelector allow for named columns).
Calling fit fails with
/opt/conda/lib/python3.6/site-packages/mlxtend/classifier/stacking_cv_classification.py in fit(self, X, y, groups, sample_weight)
239 except KeyError as e:
240
--> 241 raise KeyError(str(e) + '\nPlease check that X and y'
242 ' are NumPy arrays. If X and y are pandas'
243 ' DataFrames,\ntry passing them as'
KeyError: "'[ 2 4 5 ... 31737 31738 31739] not in index'\nPlease check that X and y are NumPy arrays. If X and y are pandas DataFrames,\ntry passing them as X.values and y.values."
Can you advise on a workaround to be able to call StackingCVClassifier in this case?
It seems that both ColumnSelector and ColumnTransformer allow one to pass in column indices. Thus, instead of
select = ColumnSelector(cols=lgb_cols)
you can do
select = ColumnSelector(cols=[train.columns.get_loc(c) for c in lgb_cols])
and then you can pass in train.values into StackingCVClassifier and the issue is solved. One thing which is very odd though is that I find the sklearn pipelines are much slower in this case.
I am not completely sure if this is related, but maybe the recent change in response to #605 fixes this. I.e., it could have been an issue related to the input checking. If you are not using the latest version from the master branch, maybe try that one to see whether the workaround your described is still required or not.
You can install the latest version from the master branch via
pip install git+git://github.com/rasbt/mlxtend.git