ToFloat fails when trying to parse numbers with "," decimal separators
Describe the bug
I am trying to convert to float a string number with comma (",") decimal separators. I am trying to use ToFloat but it is failing to parse.
Steps/Code to Reproduce
>>> import skrub
>>> import pandas as pd
>>> tf = skrub.ToFloat()
>>> s = pd.Series(["1,1", "2,2"])
>>> s
0 1,1
1 2,2
dtype: object
>>> tf.fit_transform(s)
Expected Results
>>> tf.fit_transform(s)
0 1.1
1 2.2
Actual Results
Traceback (most recent call last):
File "pandas/_libs/lib.pyx", line 2407, in pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "1,1"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/Shared/work/skrub/skrub/_to_float.py", line 194, in fit_transform
numeric = sbd.to_float32(column, strict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/work/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/functools.py", line 909, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Shared/work/skrub/skrub/_dataframe/_common.py", line 784, in _to_float32_pandas
col = pd.to_numeric(col, errors="raise" if strict else "coerce")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/work/venvs/skrub-dev/lib/python3.12/site-packages/pandas/core/tools/numeric.py", line 235, in to_numeric
values, new_mask = lib.maybe_convert_numeric( # type: ignore[call-overload]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pandas/_libs/lib.pyx", line 2449, in pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "1,1" at position 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Shared/work/skrub/skrub/_apply_to_cols.py", line 176, in fit_transform
return f(self, X, y=y, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Shared/work/skrub/skrub/_to_float.py", line 197, in fit_transform
raise RejectColumn(
skrub._apply_to_cols.RejectColumn: Could not convert column None to numbers.
Versions
System:
python: 3.12.8 (main, Dec 6 2024, 19:42:06) [Clang 18.1.8 ]
executable: /Users/work/venvs/skrub-dev/bin/python
machine: macOS-26.0.1-arm64-arm-64bit
Python dependencies:
sklearn: 1.7.1
pip: None
setuptools: 80.9.0
numpy: 2.3.2
scipy: 1.16.1
Cython: None
pandas: 2.3.2
matplotlib: 3.10.5
joblib: 1.5.1
threadpoolctl: 3.6.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 8
prefix: libomp
filepath: /Users/work/venvs/skrub-dev/lib/python3.12/site-packages/sklearn/.dylibs/libomp.dylib
version: None
0.7.dev0
I am not sure how to deal with this problem. It may be a problem depending on the locale that is being used by the user, or when using ToFloat on the result of string operations (e.g., if the series is the result of splitting strings in another series #1726).
A possible solution could just be replacing , with .. Another idea could be asking the user to provide the locale they're using. Or, we just let the user deal with the problem themselves by replacing ,.
I am also unsure how to add a more informative error message, given that if we are able to understand the error, we might be able to address it in the first place.
Another example: France's locale would write 4 567,8 rather than 4,567.8, and usually the , that separates thousands is omitted
It may be a problem depending on the locale that is being used by the user,
The locale will definitely be a factor
Another idea could be asking the user to provide the locale they're using.
Might be an option, but I figure that many of our users don't know what a locale is (Gabriela, do you know what a locale is?)
Or, we just let the user deal with the problem themselves by replacing ,.
How many variants are there of writing numbers? We can probably design a user-understandable API that enables to specify them
Some additional considerations (courtesy of LeChat):
1. Decimal and Thousand Separators
- Dot as decimal, comma as thousand separator:
1,234.56(US, UK, and many others)- Comma as decimal, dot or space as thousand separator:
1.234,56(much of Europe, Latin America)1 234,56(some European countries, e.g., France, Russia)- No thousand separator:
1234.56or1234,562. Grouping of Digits
- Thousands grouped by comma:
1,234,567.89(US, UK)- Thousands grouped by dot:
1.234.567,89(Germany, Spain, Italy)- Thousands grouped by space:
1 234 567,89(France, Russia, some Scandinavian countries)- Thousands grouped by apostrophe:
1'234'567.89(Switzerland)3. Scientific Notation
- Dot as decimal:
1.23e+4or1.23E+4- Comma as decimal:
1,23e+4or1,23E+44. Currency Formats
- Prefix/suffix symbols:
$1,234.56,1.234,56 €,CHF 1'234.56- No space between symbol and number:
€1.234,565. Negative Numbers
- Minus sign before the number:
-1,234.56or-1.234,56- Parentheses for negative numbers:
(1,234.56)or(1.234,56)6. Percentages
- Dot as decimal:
50.5%- Comma as decimal:
50,5%7. Special Cases
- Indian numbering system: Lakhs and crores:
1,23,456.78(1 lakh = 100,000)- Chinese/Japanese numbering:
12,3456.78(grouped by 4 digits in some contexts)8. Leading/Trailing Zeros
- Dot as decimal:
0.56or.56- Comma as decimal:
0,56or,56
Recommendations for the Interface
- Accept both
.and,as decimal separators.- Accept
,,.,, and'as thousand separators.- Strip all non-numeric characters except
.and,before parsing.- Use a library (e.g.,
locale,globalize, orintl) to parse numbers according to the user's locale.- Provide clear error messages if parsing fails.
My take:
- Let's ignore any unit (no %, currency etc.) (see #1726)
- We could remove all non-numeric characters except for
,,.,+-,e - We might need to design some kind of heuristic after that
Both pandas and polars in their respective read_csv function require the user to specify the decimal separator, and polars does not accept a thousands separator as far as I can tell, so it does not look like we can rely on those libraries for the parsing.
I think that if the user can provide the decimal separator we already get somewhere. Trying to guess the format would require heuristics and is going to be far more complicated.
Maybe allowing to specify the thousands separator & the decimal separator would be a good starting point indeed. It can be handled efficiently because the thousands separator can just be removed and the dataframe library can deal with the decimal separator itself (and otherwise we could replace it with a dot)
We could remove all non-numeric characters except for ,, ., +-, e
I would not remove any characters (besides the thousand separator) because this would result in misinterpreting some text as numbers (but as you say there is a separate issue for reading units)
Later we may try to come up with some heuristic (eg if we see several ',' and only one '.', we are probably facing the english/american notation), but some fairly realistic cases will remain ambiguous (for example we have ints between 0 and < 1M using ',' in english notation -- hard to tell if ',' is thousands or decimal separator), in which case we can fall back to assuming the US locale
regarding lechat's output: relying on the user's locale will not work, because what matters is what locale was used to generate the file, not to read it -- and often the one used to write the file will be different and hard to guess even for humans who do know about locales
regarding lechat's output: relying on the user's locale will not work, because what matters is what locale was used to generate the file, not to read it -- and often the one used to write the file will be different and hard to guess even for humans who do know about locales
Yes, I agree that using the locale is not going to work in most cases.