glum
glum copied to clipboard
Interactions
Fantastic project.
I would love to see the possibility to add interactions on the fly, just like H20. There, you can provide a list of interaction pairs or, alternatively, a list of columns with pairwise interactions.
This would be especially useful as scikit-learn preprocessing does not allow to create dummy encodings for categorical X and then calculate their product with another feature. (At least not with neat code.)
@tbenthompson @lbittarello @jtilly Is there any official statement concerning this feature?
In my perspective, being able to specify interaction terms is the largest blind spot of production grade GLMs in python.
@MartinStancsicsQC is looking into it in the context of this PR in tabmat
. :)
Hey @mayer79, @lorentzenchr, I'd be very interested if the formula interface proposed in #670 would fit your use cases for specifying interactions. You can also find some info in this tutorial instead if the PR itself.
I 👍. The question is: is it efficient? (Interactions with dummies generate many 0). And: is it safe to load a serialized model and use it to predict on unseen data?
Good points. It should be efficient. For example, in the case of categegorical-categorical interactions, it never actually expands them to dummies. The new (categorical) variable representing the interaction is created directly from category codes.1
And yes, the model remains pickleable (there is a test for this on the tabmat side), and also keeps track of categorical levels2 so it can still predict correctly if there are missing/unseen levels in the new data.
1: More generally, we are not doing a pandas.DataFrame
$\xrightarrow[formula]{formulaic.model\_matrix}$pandas.DataFrame
$\xrightarrow[]{tabmat.from\_pandas}$ tabmat.MatrixBase
type of multi-step process, but instead use an custom formulaic subclass to perform pandas.DataFrame
$\xrightarrow[formula]{tabmat.TabmatMaterializer}$ tabmat.MatrixBase
directly, utilizing tabmat's strengths.
2: This feature is also a bit more general, and works with a number of stateful transformations. E.g., if you use the scale
function in a formula to normalize your predictors, and then you predict on new data, the latter will be normalized based on the mean and variance of the training data.
Wow, thanks a lot for the explanations. Really looking forward to this!