skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Expose `OnEachColumn` and `OnSubFrame`

Open Vincent-Maladiere opened this issue 6 months ago • 2 comments

This Stack Overflow issue sparked a discussion among the skrub developers about the easiest solution using skrub.

In addition to the skrub expressions, one solution that came to mind is to use skrub._on_each_column.OnEachColumn:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import skrub  # your friend here
from skrub import selectors as s
from skrub._on_each_column import OnEachColumn

X = np.hstack((
    np.random.random((1000, 2)),
    np.random.randint(2, size=(1000, 2)))
)
X = pd.DataFrame(X, columns=list("abcd"))
y = np.random.random(1000)

# Create a column selector with arbitrary logic.
not_binary = s.filter(lambda col: col.nunique() > 2)
preprocessor = OnEachColumn(StandardScaler(), cols=not_binary)
print(preprocessor.fit_transform(X))

# Bring it into a sklearn pipeline
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge

pipe = make_pipeline(preprocessor, Ridge()).fit(X, y)
  1. Should we publicly expose OnEachColumn and OnSubFrame? Their main use case would be to apply a transformer using selectors
  2. Should we rename these classes by highlighting how they relate to skb.apply? Apply, ApplyCols?

Vincent-Maladiere avatar Jun 05 '25 16:06 Vincent-Maladiere

As you know, I'm in favor of exposing at, with name "ApplyOnCols"

GaelVaroquaux avatar Jun 05 '25 17:06 GaelVaroquaux

skrub meeting suggestion: ApplyToCols or ApplyToColumns

rcap107 avatar Jun 16 '25 08:06 rcap107

We can now answer the StackOverflow post :)

Vincent-Maladiere avatar Jul 11 '25 13:07 Vincent-Maladiere