xsv icon indicating copy to clipboard operation
xsv copied to clipboard

improve numeric type inference in xsv stats

Open jungle-boogie opened this issue 10 years ago • 8 comments

Hello,

Sample data here: http://www.sharecsv.com/s/9096d32f98aa0ac671a1cca16fa43be8/SalesJan2009.csv

Running xsv stats SalesJan2009.csv --everything | xsv table

Why does it say the price max is 1200 and the min is 800? The max is actually 7500 and I see a record for 250.

Thanks!

jungle-boogie avatar Aug 18 '15 05:08 jungle-boogie

Here is the output I see:

[andrew@Serval tmp] xsv stats -s Price SalesJan2009.csv | xsv table
field  type     min   max  min_length  max_length  mean  stddev
Price  Unicode  1200  800  3           6

Notice that the type inferred for the Price field is Unicode. Therefore, xsv thinks it is a string instead of a number. I see one record that is "13,000" which naively does not parse as an integer. If I change that to 13000, then xsv stats works as expected:

[andrew@Serval tmp] xsv stats -s Price SalesJan2009.csv | xsv table
field  type     min  max    min_length  max_length  mean         stddev
Price  Integer  250  13000  3           5           1633.767535  1155.455403

I expect there is room for xsv to improve on its type inference.

BurntSushi avatar Aug 18 '15 11:08 BurntSushi

Hi Burnt,

Yes, I had a feeling that how it interpreted the data may affect the outcome of it's results.

How did you change it from unicode to integer?

Thanks!

jungle-boogie avatar Aug 18 '15 13:08 jungle-boogie

I fixed your data:

I see one record that is "13,000" which naively does not parse as an integer. If I change that to 13000, then xsv stats works as expected:

13,000 doesn't parse as an integer because of the comma.

BurntSushi avatar Aug 18 '15 13:08 BurntSushi

Hi,

So correcting that one field made it read all records as integer.

Got it, thanks!

jungle-boogie avatar Aug 18 '15 13:08 jungle-boogie

Yes. xsv looks at the type of each value. If any of them can't be inferred as numeric, then it falls back to interpreting them as plain text.

There's some middle ground here, but we have to be careful with the heuristics we choose.

There's no need to close this issue. Improving xsv's type inference is a worthwhile goal.

BurntSushi avatar Aug 18 '15 13:08 BurntSushi

Hi Brunt,

There's no need to close this issue. Improving xsv's type inference is a worthwhile goal.

I apologize about that and I look forward to any improvements you may bring forth.

Thanks!

jungle-boogie avatar Aug 18 '15 13:08 jungle-boogie

What about being able to force the type -- no need to implement dangerous heuristics, leave the responsibility to the user. The only thing to implement would be to be able to interpret the plain text as number, however it is formatted.

br3no avatar Nov 02 '17 12:11 br3no

A few years later -- I often get 90% of the way with xsv slice / search and then want to end with a stats but get tripped up by a column with currency ($1,300) in which the comma and $ are problematic, and then I'm back to sed / awk. I'd be happy to work on a PR to address this, but I think it would need some architectural direction. Should this be a xsv transform subcommand? A strip or remove option to an existing subcommand?

Thanks for a cool tool :)

n8henrie avatar Mar 14 '23 13:03 n8henrie