dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

Incorrect parsing of CSV file. Locale is ignored for some columns

Open antonarhipov opened this issue 1 year ago • 3 comments

I have encountered a funny bug when parsing a CSV file. The floating point numbers in that file are specified with a comma , instead of dot ., and this should be a subject of specifying the correct locale. However, it seems that Locale is respected for one column and ignored for the other. For instance, the comma is ignored for the number 68,83, and the result is 6883.

image

CSV file: https://github.com/antonarhipov/kotlin-sandbox/blob/master/energy-consumption.csv Notebook: https://github.com/antonarhipov/kotlin-sandbox/blob/master/demo.ipynb

antonarhipov avatar Feb 28 '24 15:02 antonarhipov

Ok, this is interesting. Here's the culprit:

val nf = NumberFormat.getInstance(Locale("et", "EE"))
nf.parse("-0,08")

This code throws an exception. Because in Estonian locale a different minus sign is used. So DataFrame tries to parse this column using locale specific rules and fails. Next step of parsing is locale agnostic and it creates wrong values. We need to improve UX in this part. The fact that locale parameter defines not only , and . is surprising. Maybe it shouldn't.. Suggested workaround: use ParserOptions(Locale.FRANCE)

koperagen avatar Feb 28 '24 18:02 koperagen

Note for future: let's check how it's handled in R dplyr and maybe use it as a reference. I think they might be good at interpretation of all kinds of data

koperagen avatar Feb 28 '24 19:02 koperagen

Ok, this is interesting. Here's the culprit:

val nf = NumberFormat.getInstance(Locale("et", "EE"))
nf.parse("-0,08")

This code throws an exception. Because in Estonian locale a different minus sign is used. So DataFrame tries to parse this column using locale specific rules and fails. Next step of parsing is locale agnostic and it creates wrong values. We need to improve UX in this part. The fact that locale parameter defines not only , and . is surprising. Maybe it shouldn't.. Suggested workaround: use ParserOptions(Locale.FRANCE)

Actually it makes sense that the locale parameter also looks at minus signs etc. Just have a look at Java's DecimalFormatSymbols, it includes the zero digit, grouping separator, decimal separator, per mille sign, percent char, NaN, minus sign, monetary decimal separator, exponent separator, etc... The decimal character is just the tip of the iceberg really. So, if you read a csv with an estonian locale it will do so :). If you have mixed locales within one file, I'd suggest reading the erroneous columns as String and parsing them manually with the correct locale in the step afterwards.

Jolanrensen avatar Oct 09 '24 09:10 Jolanrensen

fixed with ParserOptions(useFastDoubleParser = true). Eventually this new parser will become the default :)

Jolanrensen avatar Oct 31 '24 16:10 Jolanrensen