polars icon indicating copy to clipboard operation
polars copied to clipboard

Add a decimal parameter to read_csv / scan_csv

Open Bebio95 opened this issue 2 years ago • 16 comments

Problem description

As a french user of polars, it would be very convenient to have a decimal parameter (as in Pandas) to specify it (',' for France but I think it's also the case in Germany) and obtain directly the desired dataframe, without being forced to use a str.replace on every import.

Bebio95 avatar Feb 06 '23 10:02 Bebio95

As a Latam User I come across the same necessity. I've recently made an PR to be able to give this type of parsing instructions to pyarrow through our Polars API, the only limitation is the pyarrow is not available for use on the lazy api with scan_csv.

igmriegel avatar Feb 06 '23 11:02 igmriegel

I'm curious what the CSV files look like - if they are using a comma inside the number, presumably they must use a different (non-comma) separator? (TAB, perhaps?) Or are all the numeric values typically double-quoted instead?

For example...

colx\tcoly
1,234\t4,567

...or:

colx,coly
"1,234","4,567"

alexander-beedie avatar Feb 06 '23 19:02 alexander-beedie

...

I can answer for France, where the semicolon is used as separator.

Bebio95 avatar Feb 06 '23 20:02 Bebio95

The file standard separator for us is semi-colon, so the txt would be like:

colx;coly
10,20;4,5

Some systems sometime even use pipe as separator, but it is not something that is a problem.

colx|coly
10,20|4,5

The important detail is that the numbers are not involved by quotes and the comma is the decimal separator, therefore: latam -> 500.432,98 (Probably french too) us -> 500,432.98

We just swap the dot for the comma.

edit_1 https://en.wikipedia.org/wiki/Decimal_separator#Examples_of_use

A wiki table that shows the common radix point patterns across the globe.

igmriegel avatar Feb 06 '23 21:02 igmriegel

@Bebio95 For now you could use the pyarrow to import your data and use the polars.from_arrow to convert to Polars https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.from_arrow.html#

The pyarrow parsing options: https://arrow.apache.org/docs/python/csv.html#customized-conversion

igmriegel avatar Feb 06 '23 21:02 igmriegel

Ok, looks much as I expected, thanks; I think I can add this facility into the polars-native (Rust) parser at very little cost, but probably not until the weekend 👍

alexander-beedie avatar Feb 07 '23 07:02 alexander-beedie

Hello, For the same reasons, it would be nice to have a thousands argument (optional, would be sometimes set to '.' by European and LatAm users, and to ',' by English speaking users). This value is often present due to formatting, like in the example above https://github.com/pola-rs/polars/issues/6698#issuecomment-1419753224, and the character could just be dropped before the actual parsing. Thanks!

danilogalisteu avatar Feb 11 '23 22:02 danilogalisteu

Is there any progress on this? It would be great to be able to specify polars.read_csv('foo,csv', decimal_char=",")` for us europeans. Every column with decimal value defaults to string instead of float.

bjornasm avatar Apr 12 '23 11:04 bjornasm

This PR only aid the write function, we need this parameters on the read and scan functions too... https://github.com/pola-rs/polars/issues/7806

igmriegel avatar Apr 12 '23 18:04 igmriegel

I would appreciate this feature being added too!

JavierRojas14 avatar Apr 13 '23 14:04 JavierRojas14

I would love to see this too! In my experience, CSVs with a comma as a decimal separator (and usually semi-colon als column separator) is unfortunately very common :D

Julian-J-S avatar Apr 18 '23 09:04 Julian-J-S

It seems that the fast_float crate is used to parse to floating point types, but it does not support specifying a different decimal separator. The lexical crate, which is already used to parse to integer types, does support it through its lexical::parse_with_options method. Though I guess using lexical instead of fast_float could incur a performance hit. I would be willing to work on implementing this feature.

LucasBou avatar Oct 25 '23 12:10 LucasBou

@alexander-beedie just going thru my list of CSV issues - did you end up getting this to work? fast_float::parse doesn't seem to support custom decimal separators unfortunately. As a related issue, it would be nice to support thousands separators for both int (e.g. "10,000") and float (e.g. "10,000.5"), with the ability to customize both the thousands separator and the decimal separator.

Wainberg avatar Jan 15 '24 00:01 Wainberg

@alexander-beedie just going thru my list of CSV issues - did you end up getting this to work? fast_float::parse doesn't seem to support custom decimal separators unfortunately.

I was planning to as our previous float parser did support this; the newer SIMD parser unfortunately does not, so this is currently stuck until such support can be added 😓

alexander-beedie avatar Jan 15 '24 08:01 alexander-beedie

Hi @alexander-beedie , just wondering what the "newer" SIMD parser is. Is it another crate or does Pola.rs have its own CSV parser implementation?

jqnatividad avatar Mar 25 '24 11:03 jqnatividad

Hi @alexander-beedie , just wondering what the "newer" SIMD parser is. Is it another crate or does Pola.rs have its own CSV parser implementation?

We have our own CSV parser, but this is referring to the SIMD string→float parsing library which that CSV parser calls when handling float-like strings.

alexander-beedie avatar Mar 25 '24 12:03 alexander-beedie

I think it would also be really nice to see a thousands parameter just like pd.read_csv(file, thousands=','). Lots of SAS-exported CSVs still throw thousands in.

kevinw26 avatar May 10 '24 05:05 kevinw26