skrub icon indicating copy to clipboard operation
skrub copied to clipboard

DOC - User guide section on dealing with sparse outputs

Open rcap107 opened this issue 3 months ago • 3 comments

Even though skrub's tagline is "machine learning with dataframes", users may need to deal with sparse data depending on the transformers they use: see #1513

It's still possible to use DataOps in this situation by setting how="full_frame" in .skb.apply, however this is not well documented in the docstrings and in the user guide.

One side effect that should be mentioned in the user guide when covering this is that a script like the following

from sklearn.decomposition import PCA
from skrub.datasets import toy_orders
from sklearn.feature_extraction.text import HashingVectorizer
X = toy_order().X
import skrub
X = skrub.X(X)
X.skb.apply(HashingVectorizer(), how="full_frame")
X_csr = X.skb.apply(HashingVectorizer(), how="full_frame")
X_pca = X_csr.skb.apply(PCA())

the problem with this is that then concatenating X_pca with a dataframe fails

X_pca.skb.concat([X])

and it becomes necessary to do X_pca.skb.apply_func(pd.DataFrame) to wrap the result back into something that can be concatenated.

The user guide needs to be updated accordingly (afterwards, we can link to that part of the guide in error messages).

We might also want to change from "full_frame" to a different name ("no_wrap"?).

rcap107 avatar Sep 17 '25 14:09 rcap107

We might also want to change from "full_frame" to a different name ("no_wrap"?).

and we can make the other options more similar to the corresponding class names:

X.skb.apply(transformer, how='cols') -> wrap transformer in ApplyToCols
X.skb.apply(transformer, how='frame') -> wrap transformer in ApplyToFrame
X.skb.apply(transformer, how='no_wrap') -> do not wrap transformer, apply directly to X
X.skb.apply(transformer, how='auto') -> decide by inspecting transformer (the default)

so the behavior would be the same but wrt to current names the renaming would be

columnwise -> cols
subframe -> frame
full_frame -> no_wrap
auto -> auto

WDYT?

jeromedockes avatar Sep 17 '25 20:09 jeromedockes

We might also want to change from "full_frame" to a different name ("no_wrap"?).

and we can make the other options more similar to the corresponding class names:

X.skb.apply(transformer, how='cols') -> wrap transformer in ApplyToCols
X.skb.apply(transformer, how='frame') -> wrap transformer in ApplyToFrame
X.skb.apply(transformer, how='no_wrap') -> do not wrap transformer, apply directly to X
X.skb.apply(transformer, how='auto') -> decide by inspecting transformer (the default)

so the behavior would be the same but wrt to current names the renaming would be

columnwise -> cols
subframe -> frame
full_frame -> no_wrap
auto -> auto

WDYT?

I like the idea, I think it's clearer this way

Though I still think that "frame" (and ApplyToFrame) does not convey the intended use case very well 🤔

rcap107 avatar Sep 18 '25 07:09 rcap107

Though I still think that "frame" (and ApplyToFrame) does not convey the intended use case very well 🤔

I think "subframe" was a bit clearer in that respect, but as the transformer has been renamed the apply() argument might as well follow

jeromedockes avatar Sep 22 '25 20:09 jeromedockes