skrub icon indicating copy to clipboard operation
skrub copied to clipboard

DISCUSSION - Automated parsing of strings that contain only numbers

Open rcap107 opened this issue 2 months ago • 8 comments

I'm opening this issue to discuss the current behavior of the Cleaner and the TableVectorizer when they encounter string columns that contain only values that can be parsed as numbers. By default, the Cleaner leaves the column alone, but when numerical features are converted to float32 (by either the Cleaner or TableVectorizer), string columns are converted silently to numbers.

import pandas as pd
from skrub import TableVectorizer, Cleaner
data = {"a": ["1", "2", "3"]}
df = pd.DataFrame(data)
cleaned = Cleaner().fit_transform(df)
cleaned.dtypes
a    object
dtype: object

Setting numeric_dtype="float32":

cleaned = Cleaner(numeric_dtype="float32").fit_transform(df)
cleaned.dtypes

gives

a    float32
dtype: object

By default, the TableVectorizer converts everything to float32

vec = TableVectorizer().fit_transform(df)
vec.dtypes
a    float32
dtype: object

This can be confusing to the user, especially if they use a numeric column converted to string to consider it categorical column (without actually converting the dtype).

I think the automated conversion in its current form is often redundant with the pandas/polars parsers, and if it's done on a dataframe prepared by the user it might be misleading, as IDs may be treated as numeric when they should instead be considered categorical.

My suggestion is that we should expose a parameter that lets the user choose whether to turn on the automatic casting, and that this parameter should be off by default. In this way, it's still possible to access the transformer, but at the same time it should be a conscious choice by the user to do that. The TableVectorizer should still convert to Float32 by default, but the automatic parsing of strings that contain numbers should not be done in all cases.

In any case, we should make this behavior very clear in the docstrings of the Cleaner and the TableVectorizer.

rcap107 avatar Oct 16 '25 08:10 rcap107

Agree with you that it should be possible to turn off this automatic transformation of string to float for some columns in the dataframe. The Cleaner and TableVectorizer apply to the whole df, but I should be able to control to what column I want to this change perform or not.

MarieSacksick avatar Oct 22 '25 16:10 MarieSacksick

My suggestion is that we should expose a parameter that lets the user choose whether to turn on the automatic casting,

+1

GaelVaroquaux avatar Oct 22 '25 16:10 GaelVaroquaux

a parameter that lets the user choose whether to turn on the automatic casting

IIRC that is already what the numeric_dtype parameter does. despite its name in practice it functions a boolean: if None there is no casting and if "float32" there is casting, and those are the only options. Perhaps some confusion comes from this parameter and I think we can simplify by always parsing strings and never converting numbers in the cleaner.

use a numeric column converted to string to consider it categorical column

Instead of converting those numbers to string, we can convert to Categorical -- it is simpler than converting to string + setting a Cleaner parameter, it better expresses the intent of the conversion / semantics of the column, and it will be handled as expected by skrub. In the example above change the string conversion to pd.Series([1, 2, 3], dtype="category")

it's still possible to access the transformer, but at the same time it should be a conscious choice by the user to do that

Note much of the Cleaner and TableVectorizer value reside in their very simple, one-size-fits-all usage. The automatic casting was there since the early days of the "supervectorizer" so there is probably some need for it. It happens after replacing strings that look like nulls (eg "N/A"), so it can find some numeric columns that may have been missed when loading a file. If we add more parameters, especially if they have complex interactions like the numeric_dtype and the one we are considering to add would have, and the default values do not turn on the full functionality, the benefit of the Cleaner vs manual cleaning diminishes.

turn off this automatic transformation of string to float for some columns

as you noted cleaner and tablevectorizer are meant to apply to the whole dataframe in a quite automatic way. adding a parameter for the casting would turn it on or off for all columns, not only some of them. What we could do is have a passthrough_cols parameter to tell the cleaner to leave some columns alone if we need more custom cleaning (or no cleaning) for those columns. this would be similar to the specific_encoders parameter of the tablevectorizer which allows overriding the encoding of some specific columns. If we don't add this parameter, the same can be acheived with ApplyToFrame to only apply the cleaner to part of the dataframe

jeromedockes avatar Oct 27 '25 07:10 jeromedockes

I think the behavior around handling numbers and numeric strings could be sanitized a bit like this:

  • Columns that are already numeric are always left alone. They are numbers so already "clean", and this way we don't lose information by converting ints to floats. Note that at some point we want to convert everything to float32 to avoid wasting time & memory with float64, but in the Cleaner it is too early to do that: there are still some string and categorical columns, which may produce float64 once encoded. so this conversion is not part of "cleaning" and comes later (and indeed there is a step that does that after encoding in the TableVectorizer). It will become easy to insert that conversion anywhere in a dataop or scikit-learn Pipeline thanks to the public ToFloat32 transformer after https://github.com/skrub-data/skrub/pull/1687
  • We always attempt to parse strings as numbers, in the same way that we also always attempt to parse them as dates or datetimes. If we have a column ["1", "2", "3"] that we don't want converted to numbers, we can either convert it to Categorical, or exclude it from the columns processed by the Cleaner (rather than adding a parameter that would turn casting on or off for all columns). The docstring "Examples" section shows both options.
  • To make it easier to exclude our ["1", "2", "3"] column from cleaning, we could add a parameter for columns to leave unchanged (passthrough_cols or similar) -- similar to the TableVectorizer's specific_cols, or show in the docstring examples how to do it with ApplyToFrame (or the exclude_cols parameter of DataOp.skb.apply() if we happen to be using the Cleaner in a DataOp already)
  • The numeric_dtype parameter is deprecated. If we really want to keep it, it is only applied to numeric columns at the end of the cleaner (and is disentangled from whether strings are parsed or not).

jeromedockes avatar Oct 27 '25 07:10 jeromedockes

After some more discussion, I think I agree with @jeromedockes on the approach.

With ToFloat32 becoming public, we could make it more evident that the Cleaner is not going to cast to float32 (and deprecate the numeric_dtype parameter while we're at it).

The current behavior of automatically parsing strings should be maintained, and we should insist in both the docs and the user guide that the proper way of treating columns as categorical is by converting them to categorical dtype, rather than string.

I am not sure we need a "passthrough_cols" (or whatever the name) parameter when we can use ApplyToFrame instead.

Small side comment: implementing something like #1265 would be very helpful to highlight columns that have been parsed to string.

rcap107 avatar Oct 27 '25 12:10 rcap107

As a follow up to #1687 (and for clarity in this issue), do we want to remove the numeric_dtype parameter from the Cleaner, and state explicitly that converting to float32 should be done explicitly using a scikit-learn pipeline like make_pipeline(Cleaner(), ToFloat())?

This is unrelated from the automated parsing of floats from strings, and the TableVectorizer would not be touched by this change.

rcap107 avatar Nov 06 '25 10:11 rcap107

This was discussed briefly today, and we didn't reach a decision.

Part of the problem is that the current UI is unclear (something that was already touched in the issue), but the current behavior of the Cleaner (parsing numerical features, and optionally converting to float32) should be kept.

It should however be possible to turn off the automatic parsing in some way, and this would be done for all columns, so no passthrough (passthrough can be achieved with ApplyToCols anyway). We should use the numeric_dtype parameter to specify the behavior, and rename the values accepted by the argument.

I don't have a very good idea for the parameter names, maybe:

  • "ignore" => numbers with string dtype are ignored and kept as strings, numerical columns keep their dtype
  • "parse" => numbers with string dtype are converted to float32 (current behavior), numerical columns keep their dtype
  • "convert" => same as "parse", but convert numerical columns to float32 (the current behavior when numeric_dtype is set to "float32"

In any case, this needs some more discussion before working on the problem.

rcap107 avatar Nov 24 '25 10:11 rcap107

A case in point for adding the option of disabling casting:

If someone is trying to pre-process many tables automatically, and these tables contain numerical IDs that are represented as strings, the Cleaner and TableVectorizer would convert them to numbers automatically, but this is not the desired behavior because then they would become numerical features and throw a wrench in the actual model training.

rcap107 avatar Dec 12 '25 12:12 rcap107