skrub Support Polars dataframes across the library

Currently, we only partially support Polars dataframes, in most cases thanks to skrub._utils.check_input that converts dataframes into numpy arrays via sklearn.utils.validation.check_array.

Moreover, https://github.com/skrub-data/skrub/pull/733 introduced Pandas and Polars operations like aggregation and join. Note that this duplicated logic will be replaced in the mid-term by the dataframe consortium standard, as discussed in https://github.com/skrub-data/skrub/discussions/719

The following methods need to be fixed to enable Polars dataframes:

[x] TableVectorizer.get_feature_names_out()
[ ] fuzzy_join()

The following tests need to at least check for polars dataframe inputs:

[x] test_deduplicate.py
[x] test_fuzzy_join.py
[x] test_minhash_encoder.py
[x] test_gap_encoder.py
[x] test_similarity_encoder.py
[x] test_table_vectorizer.py
[x] test_datetime_encoder.py
[x] test_fast_hash.py
[x] test_joiner.py

We also need to enable polars output with our TableVectorizer, by running:

tv = TableVectorizer()
tv.set_output(transform="polars")
# X and X_transformed are Polars dataframes
X_transformed = tv.fit_transform(X)

Having Polars output in ColumnTransformer is currently under discussion at https://github.com/scikit-learn/scikit-learn/issues/25896. When made available in ColumnTransformer, this feature will also be available in TableVectorizer directly.

In the meantime, we could create a minimalistic workaround to enable Polars outputs.

This will require:

TableVectorizer.get_feature_names_out() (mentioned above) to be fixed
[x] https://github.com/skrub-data/skrub/pull/761 to be merged

To accomplish this, I suggest to:

Overwrite in TableVectorizer the set_output function, initially defined in TransformerMixin parent class, _SetOutputMixin:
- For Pandas output, nothing changes, we only call super().set_output(transform="pandas")
- For Polars output, we only set a private flag.
During fit, if the flag is activated we set self.column_transformer.set_output(transform="pandas"), and use the flag again after self.column_transformer.fit_transform(X) to convert the output to a Polars dataframe.
We also check for the flag in transform and apply the same logic.

Sep 29 '23 15:09 Vincent-Maladiere

I'm working on testing for polars inputs in :

test_deduplicate.py test_fuzzy_join.py test_minhash_encoder.py test_gap_encoder.py test_similarity_encoder.py test_table_vectorizer.py test_datetime_encoder.py test_fast_hash.py test_joiner.py

Oct 12 '23 08:10 TheooJ

I wonder if instead of creating separate tests to compare polars to pandas, we should parametrize the existing tests to run them once on pandas dataframes and once on polars dataframes?

Oct 13 '23 11:10 jeromedockes

as is done in this test for the agg joiner for example

Oct 13 '23 11:10 jeromedockes

I wonder if instead of creating separate tests to compare polars to pandas, we should parametrize the existing tests to run them once on pandas dataframes and once on polars dataframes?

Fine with me. Whatever makes the code more natural and readable.

Oct 13 '23 13:10 GaelVaroquaux

All done, last item was completed in #945

Jul 25 '24 09:07 TheooJ

Congratulations, this is great!

Maybe a line in the CHANGES.rst to say that support of polars is now complete?

Jul 25 '24 14:07 GaelVaroquaux