improve numeric type inference in xsv stats
Hello,
Sample data here: http://www.sharecsv.com/s/9096d32f98aa0ac671a1cca16fa43be8/SalesJan2009.csv
Running xsv stats SalesJan2009.csv --everything | xsv table
Why does it say the price max is 1200 and the min is 800? The max is actually 7500 and I see a record for 250.
Thanks!
Here is the output I see:
[andrew@Serval tmp] xsv stats -s Price SalesJan2009.csv | xsv table
field type min max min_length max_length mean stddev
Price Unicode 1200 800 3 6
Notice that the type inferred for the Price field is Unicode. Therefore, xsv thinks it is a string instead of a number. I see one record that is "13,000" which naively does not parse as an integer. If I change that to 13000, then xsv stats works as expected:
[andrew@Serval tmp] xsv stats -s Price SalesJan2009.csv | xsv table
field type min max min_length max_length mean stddev
Price Integer 250 13000 3 5 1633.767535 1155.455403
I expect there is room for xsv to improve on its type inference.
Hi Burnt,
Yes, I had a feeling that how it interpreted the data may affect the outcome of it's results.
How did you change it from unicode to integer?
Thanks!
I fixed your data:
I see one record that is
"13,000"which naively does not parse as an integer. If I change that to13000, then xsv stats works as expected:
13,000 doesn't parse as an integer because of the comma.
Hi,
So correcting that one field made it read all records as integer.
Got it, thanks!
Yes. xsv looks at the type of each value. If any of them can't be inferred as numeric, then it falls back to interpreting them as plain text.
There's some middle ground here, but we have to be careful with the heuristics we choose.
There's no need to close this issue. Improving xsv's type inference is a worthwhile goal.
Hi Brunt,
There's no need to close this issue. Improving xsv's type inference is a worthwhile goal.
I apologize about that and I look forward to any improvements you may bring forth.
Thanks!
What about being able to force the type -- no need to implement dangerous heuristics, leave the responsibility to the user. The only thing to implement would be to be able to interpret the plain text as number, however it is formatted.
A few years later -- I often get 90% of the way with xsv slice / search and then want to end with a stats but get tripped up by a column with currency ($1,300) in which the comma and $ are problematic, and then I'm back to sed / awk. I'd be happy to work on a PR to address this, but I think it would need some architectural direction. Should this be a xsv transform subcommand? A strip or remove option to an existing subcommand?
Thanks for a cool tool :)