skrub icon indicating copy to clipboard operation
skrub copied to clipboard

ToFloat fails when trying to parse numbers with "," decimal separators

Open gabrielapgomezji opened this issue 1 month ago • 6 comments

Describe the bug

I am trying to convert to float a string number with comma (",") decimal separators. I am trying to use ToFloat but it is failing to parse.

Steps/Code to Reproduce

>>> import skrub
>>> import pandas as pd
>>> tf = skrub.ToFloat()
>>> s = pd.Series(["1,1", "2,2"])
>>> s
0    1,1
1    2,2
dtype: object
>>> tf.fit_transform(s)

Expected Results

>>> tf.fit_transform(s)
0    1.1
1    2.2

Actual Results

Traceback (most recent call last):
  File "pandas/_libs/lib.pyx", line 2407, in pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "1,1"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/Shared/work/skrub/skrub/_to_float.py", line 194, in fit_transform
    numeric = sbd.to_float32(column, strict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/work/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Shared/work/skrub/skrub/_dataframe/_common.py", line 784, in _to_float32_pandas
    col = pd.to_numeric(col, errors="raise" if strict else "coerce")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/work/venvs/skrub-dev/lib/python3.12/site-packages/pandas/core/tools/numeric.py", line 235, in to_numeric
    values, new_mask = lib.maybe_convert_numeric(  # type: ignore[call-overload]
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/lib.pyx", line 2449, in pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "1,1" at position 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Shared/work/skrub/skrub/_apply_to_cols.py", line 176, in fit_transform
    return f(self, X, y=y, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Shared/work/skrub/skrub/_to_float.py", line 197, in fit_transform
    raise RejectColumn(
skrub._apply_to_cols.RejectColumn: Could not convert column None to numbers.

Versions

System:
    python: 3.12.8 (main, Dec  6 2024, 19:42:06) [Clang 18.1.8 ]
executable: /Users/work/venvs/skrub-dev/bin/python
   machine: macOS-26.0.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.7.1
          pip: None
   setuptools: 80.9.0
        numpy: 2.3.2
        scipy: 1.16.1
       Cython: None
       pandas: 2.3.2
   matplotlib: 3.10.5
       joblib: 1.5.1
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/work/venvs/skrub-dev/lib/python3.12/site-packages/sklearn/.dylibs/libomp.dylib
        version: None
0.7.dev0

gabrielapgomezji avatar Nov 03 '25 16:11 gabrielapgomezji

I am not sure how to deal with this problem. It may be a problem depending on the locale that is being used by the user, or when using ToFloat on the result of string operations (e.g., if the series is the result of splitting strings in another series #1726).

A possible solution could just be replacing , with .. Another idea could be asking the user to provide the locale they're using. Or, we just let the user deal with the problem themselves by replacing ,.

I am also unsure how to add a more informative error message, given that if we are able to understand the error, we might be able to address it in the first place.

rcap107 avatar Nov 03 '25 17:11 rcap107

Another example: France's locale would write 4 567,8 rather than 4,567.8, and usually the , that separates thousands is omitted

rcap107 avatar Nov 03 '25 17:11 rcap107

It may be a problem depending on the locale that is being used by the user,

The locale will definitely be a factor

Another idea could be asking the user to provide the locale they're using.

Might be an option, but I figure that many of our users don't know what a locale is (Gabriela, do you know what a locale is?)

Or, we just let the user deal with the problem themselves by replacing ,.

How many variants are there of writing numbers? We can probably design a user-understandable API that enables to specify them

GaelVaroquaux avatar Nov 03 '25 22:11 GaelVaroquaux

Some additional considerations (courtesy of LeChat):

1. Decimal and Thousand Separators

  • Dot as decimal, comma as thousand separator: 1,234.56 (US, UK, and many others)
  • Comma as decimal, dot or space as thousand separator: 1.234,56 (much of Europe, Latin America) 1 234,56 (some European countries, e.g., France, Russia)
  • No thousand separator: 1234.56 or 1234,56

2. Grouping of Digits

  • Thousands grouped by comma: 1,234,567.89 (US, UK)
  • Thousands grouped by dot: 1.234.567,89 (Germany, Spain, Italy)
  • Thousands grouped by space: 1 234 567,89 (France, Russia, some Scandinavian countries)
  • Thousands grouped by apostrophe: 1'234'567.89 (Switzerland)

3. Scientific Notation

  • Dot as decimal: 1.23e+4 or 1.23E+4
  • Comma as decimal: 1,23e+4 or 1,23E+4

4. Currency Formats

  • Prefix/suffix symbols: $1,234.56, 1.234,56 €, CHF 1'234.56
  • No space between symbol and number: €1.234,56

5. Negative Numbers

  • Minus sign before the number: -1,234.56 or -1.234,56
  • Parentheses for negative numbers: (1,234.56) or (1.234,56)

6. Percentages

  • Dot as decimal: 50.5%
  • Comma as decimal: 50,5%

7. Special Cases

  • Indian numbering system: Lakhs and crores: 1,23,456.78 (1 lakh = 100,000)
  • Chinese/Japanese numbering: 12,3456.78 (grouped by 4 digits in some contexts)

8. Leading/Trailing Zeros

  • Dot as decimal: 0.56 or .56
  • Comma as decimal: 0,56 or ,56

Recommendations for the Interface

  • Accept both . and , as decimal separators.
  • Accept ,, ., , and ' as thousand separators.
  • Strip all non-numeric characters except . and , before parsing.
  • Use a library (e.g., locale, globalize, or intl) to parse numbers according to the user's locale.
  • Provide clear error messages if parsing fails.

My take:

  • Let's ignore any unit (no %, currency etc.) (see #1726)
  • We could remove all non-numeric characters except for ,, ., +-, e
  • We might need to design some kind of heuristic after that

Both pandas and polars in their respective read_csv function require the user to specify the decimal separator, and polars does not accept a thousands separator as far as I can tell, so it does not look like we can rely on those libraries for the parsing.

I think that if the user can provide the decimal separator we already get somewhere. Trying to guess the format would require heuristics and is going to be far more complicated.

rcap107 avatar Nov 03 '25 23:11 rcap107

Maybe allowing to specify the thousands separator & the decimal separator would be a good starting point indeed. It can be handled efficiently because the thousands separator can just be removed and the dataframe library can deal with the decimal separator itself (and otherwise we could replace it with a dot)

We could remove all non-numeric characters except for ,, ., +-, e

I would not remove any characters (besides the thousand separator) because this would result in misinterpreting some text as numbers (but as you say there is a separate issue for reading units)

Later we may try to come up with some heuristic (eg if we see several ',' and only one '.', we are probably facing the english/american notation), but some fairly realistic cases will remain ambiguous (for example we have ints between 0 and < 1M using ',' in english notation -- hard to tell if ',' is thousands or decimal separator), in which case we can fall back to assuming the US locale

regarding lechat's output: relying on the user's locale will not work, because what matters is what locale was used to generate the file, not to read it -- and often the one used to write the file will be different and hard to guess even for humans who do know about locales

jeromedockes avatar Nov 04 '25 07:11 jeromedockes

regarding lechat's output: relying on the user's locale will not work, because what matters is what locale was used to generate the file, not to read it -- and often the one used to write the file will be different and hard to guess even for humans who do know about locales

Yes, I agree that using the locale is not going to work in most cases.

rcap107 avatar Nov 04 '25 08:11 rcap107