skrub icon indicating copy to clipboard operation
skrub copied to clipboard

MAINT - Ensure consistency of `CleanCategories` and `ToStr` with Pandas 3.0

Open rcap107 opened this issue 1 month ago • 1 comments

Pandas 3.0 changes the behavior of object columns to use a new String datatype (see here).

There are some transformers (CleanCategories, ToStr) that convert series to have dtype object to ensure compatibility with scikit-learn. Specifically, pd.NA is problematic for scikit-learn transformers, so it makes sense that the skrub transformers convert to object for compatibility reasons.

The problem is that this behavior changes with pandas 3.0. On one hand, scikit-learn transformers should have fewer issues with pandas datatypes, on the other hand we need to decide how to deal with the inconsistency in behavior with different configurations of the packages. CleanCategories and ToStr proactively convert series to object, but the current implementation doesn't consider some of the changes in Pandas 3.0, so they need to be updated in various ways.

I noticed this while working on #1768, because some of the doctests are failing. I haven't investigated yet if the problem is limited to doctests, but in any case we need to account for it.

rcap107 avatar Nov 24 '25 15:11 rcap107

For the moment, the plan is to skip the tests in #1768 so that the other fixes included in that PR can be merged, and address this specific issue in a separate PR, which should be merged before the next release.

Tests are skipped in commit e0f0e3e

rcap107 avatar Nov 24 '25 15:11 rcap107