Bug when training EBM during cross validation
Hello,
FIrst off, thank you so much for this awesome project.
I have a bug to report - I've simply modified the provided example script and made it a cross validation scenario:
def bug():
df = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
header=None)
columns = [
"Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
"MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
"CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]
df.columns = columns
train_cols = df.columns[0:-1]
label = df.columns[-1]
X = df[train_cols].to_numpy()
y = df[label].apply(lambda x: 0 if x == " <=50K" else 1).to_numpy()
seed = 1
kf = KFold(n_splits=5)
ebm = ExplainableBoostingClassifier(random_state=seed, n_jobs=-1)
for train_idx, test_idx in kf.split(X):
x_train = X[train_idx]
y_train = y[train_idx]
ebm.fit(x_train, y_train)
print('Train Finished')
The code above will train ebm during the first fold. However, the consecutive folds will fail to train:
Traceback (most recent call last):
File "ebm_experiment.py", line 156, in <module>
bug()
File "ebm_experiment.py", line 151, in bug
ebm.fit(x_train, y_train)
File "/home/andrew/venv_bertrade/lib/python3.8/site-packages/interpret/glassbox/ebm/ebm.py", line 823, in fit
self.preprocessor_.fit(X)
File "/home/andrew/venv_bertrade/lib/python3.8/site-packages/interpret/glassbox/ebm/ebm.py", line 186, in fit
schema = autogen_schema(
File "/home/andrew/venv_bertrade/lib/python3.8/site-packages/interpret/utils/all.py", line 374, in autogen_schema
X = pd.DataFrame(X, columns=feature_names)
File "/home/andrew/venv_bertrade/lib/python3.8/site-packages/pandas/core/frame.py", line 440, in __init__
mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
File "/home/andrew/venv_bertrade/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 213, in init_ndarray
return create_block_manager_from_blocks(block_values, [columns, index])
File "/home/andrew/venv_bertrade/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1681, in create_block_manager_from_blocks
mgr = BlockManager(blocks, axes)
File "/home/andrew/venv_bertrade/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 143, in __init__
self._verify_integrity()
File "/home/andrew/venv_bertrade/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 347, in _verify_integrity
raise AssertionError(
AssertionError: Number of manager items must equal union of block items
# manager items: 24, # tot_items: 14
However, if I simply move the line
ebm = ExplainableBoostingClassifier(random_state=seed, n_jobs=-1)
into the for loop and instantiate the ebm model each time, the problem goes away and I am able to train all 5 folds.
I am using python 3.8.5 and pandas 0.25.3.
I'm happy to provide any additional information that may be useful!
*** edit ***
I've narrowed down the problem.
In this line, during the first fold, self.feature_names has length 14 as expected
(Pdb) self.feature_names
['Age', 'WorkClass', 'fnlwgt', 'Education', 'EducationNum', 'MaritalStatus', 'Occupation', 'Relationship', 'Race', 'Gender', 'CapitalGain', 'CapitalLoss', 'HoursPerWeek', 'NativeCountry']
(Pdb) len(self.feature_names)
14
However, during the second fold, self.feature_names is of length 24, with additional features added from the prior run, which seem to be the pairwise interaction terms:
(Pdb) len(self.feature_names)
24
(Pdb) self.feature_names
['Age', 'WorkClass', 'fnlwgt', 'Education', 'EducationNum', 'MaritalStatus', 'Occupation', 'Relationship', 'Race', 'Gender', 'CapitalGain', 'CapitalLoss', 'HoursPerWeek', 'NativeCountry', 'Relationship x HoursPerWeek', 'Age x Relationship', 'MaritalStatus x HoursPerWeek', 'Occupation x Relationship', 'Relationship x CapitalLoss', 'fnlwgt x Occupation', 'EducationNum x Occupation', 'Age x CapitalLoss', 'Occupation x HoursPerWeek', 'fnlwgt x Education']
My guess is that this is because self.feature_names is being set during the first fold, and then being passed in as the input to unify_data() during the second fold, but I have not verified this. In fact, this in-line comment seems related to what I've described.
I hope this is helpful!
Hi @andrewjylee,
Thanks for bringing this up and looking into it so carefully! The main issue is that we estimate and include the interaction terms for EBMs during the fit phase of the algorithm, and modify structures like self.feature_names and self.feature_types based on the calculated results. As you've noticed, this makes it difficult to support "re-fitting" of EBMs that include interaction terms with the scikit-learn API.
The main workaround right now is to do exactly what you did, and re-initialize a fresh EBM on each iteration of your loop. Initialization should be fairly cheap, so hopefully this can work for your use case without too much of a performance hit. We're exploring some options to change this behavior in the future (like potentially splitting feature_names into feature_names_in and feature_names_out to account for the dynamic nature of interaction terms), but it might take some time for us to figure out the right API design for this.
We'll leave this issue open to track progress on this and keep the discussion open. Thanks!
-InterpretML Team
This should be resolved in our latest v0.3.0 release. Please re-open this issue if the behavior persists.