q icon indicating copy to clipboard operation
q copied to clipboard

Decimal separator for floats

Open LeDom22 opened this issue 10 years ago • 9 comments

Hello, I love your q package, great idea! Yet, I am trying to process a CSV file where the decimal separator is ',' instead of '.' (my locale is fr_FR). It seems the corresponding columns are parsed as integers and all the numerical computations give wrong results. Could you please add a locale parameter to the options list and use it to autodetect the column types? Thanks,

LeDom22 avatar Dec 14 '14 22:12 LeDom22

Ah... That's an interesting one :)

Can you provide an example file - 10-20 lines will be enough, so I can analyze what is needed?

harelba avatar Dec 14 '14 22:12 harelba

Yes, I will email it to you

LeDom22 avatar Dec 14 '14 22:12 LeDom22

Hello Harel, Here is a sample file. It is semicolon-separated and encoded in cp1252 I have not tried yet but I guess that locales also impact date fields. If they are parsed as strings, comparisons will fail. Keep up with this great work!

  • Dom

On Sun, Dec 14, 2014 at 11:08 PM, Harel Ben-Attia [email protected] wrote:

Ah... That's an interesting one :)

Can you provide an example file - 10-20 lines will be enough, so I can analyze what is needed?

— Reply to this email directly or view it on GitHub https://github.com/harelba/q/issues/80#issuecomment-66932268.

LeDom22 avatar Dec 14 '14 22:12 LeDom22

I think the "UNIX" way would be to use the LC_NUMERIC environment variable? Python also has an API to access the corressponding settings: https://docs.python.org/2/library/locale.html#locale.localeconv. But maybe q should provide an option to enable using that, because it would be a big change in behaviour.

bitti avatar Dec 15 '14 04:12 bitti

@LeDom22 Didn't seem to get your sample file by email. Can you please send it?

harelba avatar Dec 15 '14 12:12 harelba

Definitely an interesting one :)

Two bugs here actually:

  • First,the column type is not detected properly (the column is analyzed as text and not as a float/decimal. Run with -A and you'll see what I mean).
  • Second, semi-related of course, is that performing numeric operations on these columns does not work properly.

As for the second part, there is a simple (however crude workaround), which will work for you in the mean time. You can use the function replace to replace the commas with a dot before making calculations.

$ q -d ';' 'select sum(replace(`Euro line amount`,",",".")) from Cs.csv' -e cp1252 -H
124
11900
12
23
33
55
66
77
17983.75
...

Please note that in the example file you sent there are cases where the amount is some kind of code and not a real amount so sum(X) without filtering for the proper lines will not provide a correct money amount.

About the first bug, I will have to take a deeper look at it and see how q can solve this (sqlite does not support decimals out of the box, but q can - and needs - to abstract that from the user).

Harel

harelba avatar Dec 15 '14 19:12 harelba

hi I meet a similar problem. My column contains string like "033586". q -A shows it is "int", so I get result "33586", "0" is missing. Is there a workaround? If q support data-type row in csv/txt file or runtime parameter to specify data type, that will be a great help ! Thanks

haozh avatar Jan 30 '15 07:01 haozh

Hi @haozh ,

Your case even makes this issue much more critical, since it becomes impossible to get the correct data.

Currently, there's no way to specify data types manually.

I've created a temporary modified version which can force all fields to be text fields. If you'd like it, please send me an email and I'll send it to you so you can continue your work, although it will obviously impair your work with the other fields as non-text data.

If you don't want the modified version, there's one ugly workaround I can suggest - If you have a header row in your file, you could stop using the -H parameter - This will lead to a warning, but will include the header data as part of the actual data, leading to inferring the columns as text columns. Using this trick will require two things: You'd need to use the c1,c2,cN field names instead of the real ones, and you'd also have to filter-out the header line using a WHERE clause. If you don't have a header row in the file, you can just add a dummy one, and use the same trick as I described above.

If that doesn't solve your immediate problem, tell me and I'll work on a modified version which will allow you to force all the column types.

Harel

harelba avatar Jan 30 '15 10:01 harelba

Hi Harel

Thank you very much ! Please send windows version to [email protected]

Hao

haozh avatar Feb 02 '15 02:02 haozh