Make pandas an optional dependency
At the moment, pandas is a required dependency of skrub because it is required for some specific problems (parsing datetimes for one).
However, this also means that pyarrow is required to handle parquet files, and polars users do not need either.
We should try to make pandas an optional requirement and let the users pick the backend they prefer.
Of course, this would be more of a longer-time project since we'll need to re-implement all features in such a way that they do not depend on pandas anymore, which includes all the tests of the TableVectorizer (#1441)
- [ ] Rework the function we use to compute the Pearson's correlation when using polars (atm it's converting back to pandas)
- [ ] Refactor TableVectorizer tests so that they are not pandas-specific #1441
Of course, this would be more of a longer-time project since we'll need to re-implement all features in such a way that they do not depend on pandas anymore, which includes all the tests of the TableVectorizer.
Yes, this is definitely a long time project, not something that I see high up the priority list.
an important intermediate step would be avoiding direct conversions of full dataframes between pandas and polars because those require pyarrow. in particular since the addition of pearson correlation we cannot do a tablereport on a polars dataframe if pyarrow is not installed (and it is an optional dependency so for many users it will not be)
an important intermediate step would be avoiding direct conversions of full dataframes between pandas and polars because those require pyarrow. in particular since the addition of pearson correlation we cannot do a tablereport on a polars dataframe if pyarrow is not installed (and it is an optional dependency so for many users it will not be)
This is also a problem when using the DatetimeEncoder on a polars dataframe