skb.apply does not support some scikit-learn estimators
Describe the bug
The skb.apply and thus expressions (data plan, data ops) do not allow to plug-in standard scikit learn vectorizers such as CountVectorizer, HashingVectorizer, TfidfVectorizer etc. because they output a sparse matrix while expressions expect a dataframe.
There is currently no way to use set_output(transform="pandas") on these transformers and I don't think it will be possible anytime soon in scikit-learn (https://github.com/scikit-learn/scikit-learn/discussions/22377).
Steps/Code to Reproduce
import skrub
df = skrub.toy_orders().X
x = skrub.var("x", df)
# Works:
x.skb.apply(skrub.MinHashEncoder(), cols=["product"])
# Works (numeric transformer)
from sklearn.preprocessing import StandardScaler
x.skb.apply(StandardScaler(), cols=["quantity"])
# Works (on dataframe)
from sklearn.feature_extraction.text import HashingVectorizer
enc = HashingVectorizer()
enc.fit_transform(df["product"])
# Doesn't work (inside expressions)
x.skb.apply(HashingVectorizer(), cols=["product"])
Expected Results
Support for CountVectorizer, HashingVectorizer, TfidfTransformer so we can use them in expressions.
Actual Results
TypeError: HashingVectorizer.fit_transform returned a result of type csr_matrix, but a pandas DataFrame was expected. If HashingVectorizer is a custom transformer class, please make sure that the output is a pandas container when the input is a pandas container. One way of enabling a transformer to output pandas DataFrames is inheriting from the sklearn.base.TransformerMixin class and defining the 'get_feature_names_out' method. See https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html for details.
Versions
Latest version (0.6.dev0)
Good point.
We could do something about this with code (somewhere, I don't know where) that converts the output of these to acceptable output for our pipeline (ideally pandas dataframes using the get_feature_names_out)
The drawback is that converting a sparse matrix to a dense one can lead to an absolute explosion of memory usage. So maybe it is a bad idea per se.
What do people think?
I think we should allow sparse output and input in the ApplyToFrame, without dense conversion, but if the underlying estimator or logic fail, so be it.
My point of view is that all features available in regular sklearn pipelines should be available in data ops as well.
Curious to hear the thoughts of the other maintainers.
If you change the last line to this:
vectorized = x.skb.apply(HashingVectorizer(), how="full_frame")
it works and returns a sparse matrix. The name "full_frame" should probably be changed to something clearer but it means "not managed by skrub, not wrapped in ApplyToFrame nor ApplyToCols" so in this mode the input is passed as-is to the estimator and the estimator's output is returned without transformation. it is the default when the input is not a dataframe. it does not allow setting the cols
>>> import skrub
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> df = skrub.toy_orders().X
>>> x = skrub.var("x", df)
as the input is a dataframe we need to specify `how`, otherwise the default is
ApplyToFrame
>>> vectorized = x.skb.apply(HashingVectorizer(), how="full_frame")
>>> vectorized
<Apply HashingVectorizer>
Result:
―――――――
<Compressed Sparse Row sparse matrix of dtype 'float64'
with 4 stored elements and shape (4, 1048576)>
Here as the input is a sparse matrix (not a dataframe) no need to specify
anything it will automatically not wrap in ApplyToFrame nor ApplyToCols
>>> standardized = vectorized.skb.apply(StandardScaler(with_mean=False))
>>> standardized
<Apply StandardScaler>
Result:
―――――――
<Compressed Sparse Row sparse matrix of dtype 'float64'
with 4 stored elements and shape (4, 1048576)>
Nice hidden feature! I'm kinda afraid that even advanced users will struggle to find it when they need it, though.
Do you think we could hack this a little bit by trying to detect a few data container types we know are problematic? If we detect that the input data X is a sparse array or sparse matrix, we use the "full_frame" logic. This means we would have by default in apply something like how="auto", then dispatching in _wrap_estimator for example. WDYT?
as you can see for the standardizer above, when X is not a dataframe it already goes to the "full_frame" mode. the problem is when the input X is a normal dataframe but the output of the estimator is sparse. At this point it will blow up in the ApplyToCols when it tries to convert to dataframe.
what we could do is in ApplyToCols, when the output is a sparse matrix and cols is all the columns in the dataframe, produce a warning and return the sparse matrix without converting to dataframe.
it might make some other situations such as forgetting the output=dense of a one-hot a bit harder to debug though (when we want the next step to get a dataframe)
or the ApplyToCols raises a specific exception like SparseOutputError that the dataop can recognize, and the dataop shows a tailored error message saying to use the how parameter
Thanks for the details. I'd rather raise a warning than an error.