Incorrect parsing of CSV file. Locale is ignored for some columns
I have encountered a funny bug when parsing a CSV file. The floating point numbers in that file are specified with a comma , instead of dot ., and this should be a subject of specifying the correct locale. However, it seems that Locale is respected for one column and ignored for the other. For instance, the comma is ignored for the number 68,83, and the result is 6883.
CSV file: https://github.com/antonarhipov/kotlin-sandbox/blob/master/energy-consumption.csv Notebook: https://github.com/antonarhipov/kotlin-sandbox/blob/master/demo.ipynb
Ok, this is interesting. Here's the culprit:
val nf = NumberFormat.getInstance(Locale("et", "EE"))
nf.parse("-0,08")
This code throws an exception. Because in Estonian locale a different minus sign is used.
So DataFrame tries to parse this column using locale specific rules and fails. Next step of parsing is locale agnostic and it creates wrong values.
We need to improve UX in this part. The fact that locale parameter defines not only , and . is surprising. Maybe it shouldn't.. Suggested workaround: use ParserOptions(Locale.FRANCE)
Note for future: let's check how it's handled in R dplyr and maybe use it as a reference. I think they might be good at interpretation of all kinds of data
Ok, this is interesting. Here's the culprit:
val nf = NumberFormat.getInstance(Locale("et", "EE")) nf.parse("-0,08")This code throws an exception. Because in Estonian locale a different minus sign is used. So DataFrame tries to parse this column using locale specific rules and fails. Next step of parsing is locale agnostic and it creates wrong values. We need to improve UX in this part. The fact that locale parameter defines not only
,and.is surprising. Maybe it shouldn't.. Suggested workaround: useParserOptions(Locale.FRANCE)
Actually it makes sense that the locale parameter also looks at minus signs etc. Just have a look at Java's DecimalFormatSymbols, it includes the zero digit, grouping separator, decimal separator, per mille sign, percent char, NaN, minus sign, monetary decimal separator, exponent separator, etc... The decimal character is just the tip of the iceberg really. So, if you read a csv with an estonian locale it will do so :). If you have mixed locales within one file, I'd suggest reading the erroneous columns as String and parsing them manually with the correct locale in the step afterwards.
fixed with ParserOptions(useFastDoubleParser = true). Eventually this new parser will become the default :)