glum icon indicating copy to clipboard operation
glum copied to clipboard

Interactions

Open mayer79 opened this issue 2 years ago • 6 comments

Fantastic project.

I would love to see the possibility to add interactions on the fly, just like H20. There, you can provide a list of interaction pairs or, alternatively, a list of columns with pairwise interactions.

This would be especially useful as scikit-learn preprocessing does not allow to create dummy encodings for categorical X and then calculate their product with another feature. (At least not with neat code.)

mayer79 avatar Nov 15 '22 11:11 mayer79

@tbenthompson @lbittarello @jtilly Is there any official statement concerning this feature?

In my perspective, being able to specify interaction terms is the largest blind spot of production grade GLMs in python.

lorentzenchr avatar Jul 28 '23 11:07 lorentzenchr

@MartinStancsicsQC is looking into it in the context of this PR in tabmat. :)

lbittarello avatar Jul 28 '23 11:07 lbittarello

Hey @mayer79, @lorentzenchr, I'd be very interested if the formula interface proposed in #670 would fit your use cases for specifying interactions. You can also find some info in this tutorial instead if the PR itself.

MartinStancsicsQC avatar Aug 02 '23 16:08 MartinStancsicsQC

I 👍. The question is: is it efficient? (Interactions with dummies generate many 0). And: is it safe to load a serialized model and use it to predict on unseen data?

mayer79 avatar Aug 02 '23 18:08 mayer79

Good points. It should be efficient. For example, in the case of categegorical-categorical interactions, it never actually expands them to dummies. The new (categorical) variable representing the interaction is created directly from category codes.1

And yes, the model remains pickleable (there is a test for this on the tabmat side), and also keeps track of categorical levels2 so it can still predict correctly if there are missing/unseen levels in the new data.


1: More generally, we are not doing a pandas.DataFrame $\xrightarrow[formula]{formulaic.model\_matrix}$pandas.DataFrame $\xrightarrow[]{tabmat.from\_pandas}$ tabmat.MatrixBase type of multi-step process, but instead use an custom formulaic subclass to perform pandas.DataFrame $\xrightarrow[formula]{tabmat.TabmatMaterializer}$ tabmat.MatrixBase directly, utilizing tabmat's strengths.

2: This feature is also a bit more general, and works with a number of stateful transformations. E.g., if you use the scale function in a formula to normalize your predictors, and then you predict on new data, the latter will be normalized based on the mean and variance of the training data.

MartinStancsicsQC avatar Aug 03 '23 07:08 MartinStancsicsQC

Wow, thanks a lot for the explanations. Really looking forward to this!

mayer79 avatar Aug 03 '23 07:08 mayer79