skrub icon indicating copy to clipboard operation
skrub copied to clipboard

FEAT - Adding `decimal` as parameter for `ToFloat32`

Open gabrielapgomezji opened this issue 1 month ago • 1 comments

This issue addresses #1728

The ToFloat transformer now includes a decimal parameter that lets the user specify the decimal separator to use for the given column. Then, all the possible thousands separators are removed, and the decimal separator is converted to a . before the column is passed to to_float32.

gabrielapgomezji avatar Nov 24 '25 16:11 gabrielapgomezji

After some discussion, I think this PR needs some more time before it can be merged, and unfortunately won't be part of the next release.

The current implementation is removing all thousands separators other than what is specified as the "decimal" separator, which is quite risky and may leads to problems. It's better to follow what pandas is doing, i.e., have both decimal and thousands as separators. By default, the thousands separator should be None (so no replacement).

If there is some kind of weird string like 1,2.3,4, it should not be parsed as a number. I am not sure how far we should do to parse something like 1,2.34 with decimal . and thousands ,: it's not a format I recognize, but it would still be recognized as 12.34 rather than being rejected.

Another check that may be considered is counting the number of decimal separators, and reject any case where there is more than one.

Some additional comments:

  • While it's impossible to test all possible scenarios, tests should also include as many weird edge cases as we can come up with to see what could be the result.
  • The ToFloat docstring needs some more work to explain in more detail the behavior when decimal and thousands are set.

I'll convert this back to draft and keep an eye on this for the next PR.

rcap107 avatar Dec 02 '25 13:12 rcap107

When talking about the tests, it was mentioned to include 3 tests:

  • A test for Good inputs
  • A test for Bad Inputs
  • A test for bad parameters I merged the last two tests including also bada parameters in the test. If it's better to have the 3 tests individually instead of the 2, I will modify it.

gabrielapgomezji avatar Dec 17 '25 11:12 gabrielapgomezji

Thanks a lot for the PR @gabrielapgomezji! This will be very useful for parsing data that is not in the usual locale.

My comments are mostly about improving clarity in the documentation and adding comments in the code. I think the actual content of the PR is in a good shape, it's just a matter of polishing at this point.

rcap107 avatar Dec 17 '25 14:12 rcap107