DOC - User guide section on dealing with sparse outputs
Even though skrub's tagline is "machine learning with dataframes", users may need to deal with sparse data depending on the transformers they use: see #1513
It's still possible to use DataOps in this situation by setting how="full_frame" in .skb.apply, however this is not well documented in the docstrings and in the user guide.
One side effect that should be mentioned in the user guide when covering this is that a script like the following
from sklearn.decomposition import PCA
from skrub.datasets import toy_orders
from sklearn.feature_extraction.text import HashingVectorizer
X = toy_order().X
import skrub
X = skrub.X(X)
X.skb.apply(HashingVectorizer(), how="full_frame")
X_csr = X.skb.apply(HashingVectorizer(), how="full_frame")
X_pca = X_csr.skb.apply(PCA())
the problem with this is that then concatenating X_pca with a dataframe fails
X_pca.skb.concat([X])
and it becomes necessary to do X_pca.skb.apply_func(pd.DataFrame) to wrap the result back into something that can be concatenated.
The user guide needs to be updated accordingly (afterwards, we can link to that part of the guide in error messages).
We might also want to change from "full_frame" to a different name ("no_wrap"?).
We might also want to change from "full_frame" to a different name ("no_wrap"?).
and we can make the other options more similar to the corresponding class names:
X.skb.apply(transformer, how='cols') -> wrap transformer in ApplyToCols
X.skb.apply(transformer, how='frame') -> wrap transformer in ApplyToFrame
X.skb.apply(transformer, how='no_wrap') -> do not wrap transformer, apply directly to X
X.skb.apply(transformer, how='auto') -> decide by inspecting transformer (the default)
so the behavior would be the same but wrt to current names the renaming would be
columnwise -> cols
subframe -> frame
full_frame -> no_wrap
auto -> auto
WDYT?
We might also want to change from "full_frame" to a different name ("no_wrap"?).
and we can make the other options more similar to the corresponding class names:
X.skb.apply(transformer, how='cols') -> wrap transformer in ApplyToCols X.skb.apply(transformer, how='frame') -> wrap transformer in ApplyToFrame X.skb.apply(transformer, how='no_wrap') -> do not wrap transformer, apply directly to X X.skb.apply(transformer, how='auto') -> decide by inspecting transformer (the default)so the behavior would be the same but wrt to current names the renaming would be
columnwise -> cols subframe -> frame full_frame -> no_wrap auto -> autoWDYT?
I like the idea, I think it's clearer this way
Though I still think that "frame" (and ApplyToFrame) does not convey the intended use case very well 🤔
Though I still think that "frame" (and ApplyToFrame) does not convey the intended use case very well 🤔
I think "subframe" was a bit clearer in that respect, but as the transformer has been renamed the apply() argument might as well follow