interpret Merge ebm with different subset of variables

I am fitting a model with a subset of variables with no interaction present. I now want to fit interactions with a larger subset of variables and merge it with the original model.

The merge ebm method does not allow for this in its current form. Is there not a smart way to build a new model with the components of the two underlying models into a new clean instance?

Jul 24 '24 16:07 sadsquirrel369

Hi @sadsquirrel369 -- This is supported, but is currently a bit more complicated than it should be. In the future we want to support scikit-learn's warm_start functionality, which will make this simpler. Today, you need to do the following:

Make a dataframe or numpy array with a superset of all the features you'll need for both mains and interactions
set interactions=0 and exclude any individual features that you don't want considered in the mains.
Fit the mains model
Use exclude to exclude all mains, and you can also exclude any additional pairs you don't want to be considered for pairs. Set interactions to either a number for automatic detection, or a list of the specific interactions. Call fit using the init_score parameter set to the mains model so that it boosts the pairs on top of the mains.
call merge_ebms on the two EBMs. There are more details to this which are covered in our docs here: https://interpret.ml/docs/python/examples/custom-interactions.html

Jul 24 '24 18:07 paulbkoch

@paulbkoch Thanks for the prompt reply. So by excluding variables (with the parameter) in the model "mains" fitting, will all of the feature names be in the model.feature_names_in_ variable, irrespective of whether they were in the original dataset?

Jul 24 '24 19:07 sadsquirrel369

Hi @sadsquirrel369 -- Features that are excluded will be recorded in the model.feature_names_in_ attribute, but they will not be used for prediction. Anything that is used for prediction is called a "term" in EBMs. If you print the model.term_names_ you'll see a list of everything that is used for prediction. For some datatypes like numpy arrays there are no column names and features are determined by their index, so it's important in these cases that both the features used in mains and the features used in pairs are all in the same dataset, even if they are not used in the model.

Jul 24 '24 20:07 paulbkoch

Thanks for the help!

Jul 25 '24 11:07 sadsquirrel369

Hi @paulbkoch,

When trying to merge the mains model with an interaction model I get this issue:

Inconsistent bin types within a model:

`--------------------------------------------------------------------------- Exception Traceback (most recent call last) /var/folders/3b/lp8_hqx917138jd8rxttmzjc0000gn/T/ipykernel_35823/3985698511.py in ----> 1 merge_ebms([loaded_model,loaded_int_model])

/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/interpret/glassbox/ebm/merge_ebms.py in merge_ebms(models) 392 for model in models: 393 if any(len(set(map(type, bin_levels))) != 1 for bin_levels in model.bins): --> 394 raise Exception("Inconsistent bin types within a model.") 395 396 feature_bounds = getattr(model, "feature_bounds", None)

Exception: Inconsistent bin types within a model.`

It appears the issue stems from some variables used in the interaction not having bin values in the main models because they were excluded. The interaction models will work correctly when the variables are present in both the main and interaction models. However, some variables are only beneficial when used in an interaction and not on their own (for example for vehicle classification where the combination of weight and power can help us identify different vehicle types).

Jul 31 '24 06:07 sadsquirrel369

Hi @sadsquirrel369 -- This is really interesting. It appears you have a model where one of the feature mains is considered a categorical or continuous, but a pair using the same feature is considered to be the opposite. Are you doing any re-merging where you first merge a set of models and then merge that result again with some other models, or is it happening on the first merge when the main and interaction models are combined?

You can probably avoid this error by explicitly setting the feature_types parameter on all calls to the ExplainableBoostingClassifier constructor, thereby ensuring they are identical in all models being merged. This is something we could handle better though within merge_ebms. We can convert a feature from categorical into continuous during merges, but perhaps this isn't completely robust to more complicated scenarios involving pairs.

Aug 02 '24 22:08 paulbkoch

I'm currently encountering this same error when trying to merge two EBMs I have ~10 features and I'm wondering if there's a streamlined way to specify all of these feature types? I'm getting the inconsistent bins error on the second merge (basically I'm trying to batch train an EBM model since my data is larger than what can fit in memory). I specify the data types using the feature_types parameter using the below snippet:

dtypes = ['continuous' if d == 'float64' else None if d == 'int64' else 'ordinal' if col in ordinal_types else 'nominal' for d, col in zip(X_trn.dtypes, X_trn.columns)]

And when I try to implement the workaround suggested in issue #576, I still get the same error

for attr, val in clf1.get_params(deep=False).items():
    if not hasattr(clf, attr):
        setattr(clf, attr, val)

Oct 10 '24 15:10 ANNIKADAHLMANN-8451

Hi @ANNIKADAHLMANN-8451 -- Can you verify that the dtypes in the pandas dataframes match? Pandas auto-infers dtypes too, and if one of the datasets has a single sample with a string that can't be represented as a float it would mismatch the dtypes in pandas, or alternatively it could give one dataset an int64 value and another float64. Either of these conditions would then cause a mismatch in the dtypes variable above, and then in the models.

If that doesn't present a solution, can you please output the value of ebm.feature_types_in_ for both models and post that here.

Side note 1: From a previous issue on Azure Synapse EBMs, I'm aware that 8451 uses Azure in some capacity. I should probably put on my Microsoft sales hat and mention that Azure does have VMs with 4TB of memory available. It might allow you to avoid batch processing.

Side note 2, which isn't related to this issue: I think you should replace the 'ordinal' string above with a list that contains the ordinal strings. For ordinals like ["low", "medium", "high"] the order cannot be inferred, so most of the time you need to specify it. The default for 'ordinal' is to sort them alphabetically, but that's rarely what you want. I've recently removed 'ordinal' from the documentation as an option, and plan to deprecate it at some point.

Oct 13 '24 18:10 paulbkoch

Just to verify, I'm thinking about the conversions correctly (pandas types -> EBM types):

float64 -> continuous
int64 -> None
ordinal -> list of str (in order of importance)
object (where order doesn't matter) -> nominal

I don't think I would be able to run a 4TB VM for cost purposes, but I definitely should look into optimizing my cluster! And I just changed those ordinal values accordingly thank you for the side note :)

Oct 14 '24 16:10 ANNIKADAHLMANN-8451

Your mapping makes sense to me, although interpret is flexible enough to treat floats and ints as nominal/ordinal if you specify that. If you are asking what interpret uses when 'auto' is specified, the default behavior for EBMs is that float64 and int64 are continuous. For objects or strings, if all the feature values can be converted to floats, then it treats them as continuous too. Anything with non-float representable content is 'nominal'.

Oct 15 '24 04:10 paulbkoch

That makes sense! Thank you for the details and the timely response. We figured out my bug for now. I was calling merge_ebms() in a for loop which was causing the error and the solution I found was to append a list of EBMs then call merge_ebms() once at the end.

Oct 15 '24 14:10 ANNIKADAHLMANN-8451