xsv
xsv copied to clipboard
Feature: reshape data
I think it'd great to be able to reshape data in a similar fashion to:
- the
reshape
verb of Miller - the
dcast.data.table
andmelt.data.table
functions of R'sdata.table
(and similar ones inreshape2
,tidyr
, etc) - the
melt
andpivot
functions of Pandas (minus the MultiIndex functionality initially, to reduce complexity)
It would be amazing to be able to do this with files which are too big for memory, too. That's fairly easy for 'melting' (i.e. going from 'wide' to 'long'), but less so for 'pivoting' (going from 'long' to 'wide' format), unless the data is guaranteed to be sorted by the correct columns.
Miller appears to be able to do this but leaves a little to be desired; melting seems to be very slow for extremely wide (100000+ column files), and pivoting requires all the data to be read into memory.
I can attempt a PR if you think this is a good idea!
Could you describe this feature without pointing to other tools? I just read the links you gave, but I've found the docs to be inscrutable. If you could describe the relationship between input and output more directly, that would help. It would also help to state the relationship between this desired feature and xsv flatten
.
Of course, sorry!
The idea is to provide functionality to convert a CSV file between two formats. The formats are:
wide
This is the 'default' format you'd expect to see a CSV file in: one row per sample, one column per variable. Generally there will be one or more columns used to identify a row. Here's an example using the 'mtcars' dataset included in R. In this case, the 'car' column is the identifier.
➜ ~ xsv table mtcars_wide.csv
car mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
Duster 360 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
Cadillac Fleetwood 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
Lincoln Continental 10.4 8 460 215 3 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
AMC Javelin 15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
Camaro Z28 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
Porsche 914-2 26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Maserati Bora 15 8 301 335 3.54 3.57 14.6 0 1 5 8
Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
long (aka Entity-Attribute-Value/EAV)
This alternative format stores the data as a set of key-value pairs attached to each ID:
➜ ~ xsv table mtcars_long.csv | head
car variable value
Mazda RX4 mpg 21
Mazda RX4 Wag mpg 21
Datsun 710 mpg 22.8
Hornet 4 Drive mpg 21.4
Hornet Sportabout mpg 18.7
Valiant mpg 18.1
Duster 360 mpg 14.3
Merc 240D mpg 24.4
Merc 230 mpg 22.8
➜ ~ xsv table mtcars_long.csv | tail
AMC Javelin carb 2
Camaro Z28 carb 4
Pontiac Firebird carb 2
Fiat X1-9 carb 1
Porsche 914-2 carb 2
Lotus Europa carb 2
Ford Pantera L carb 4
Ferrari Dino carb 6
Maserati Bora carb 8
Volvo 142E carb 2
The proposed reshape
command would allow users to convert between these formats using two modes:
- melt takes a 'wide' input and a set of one or more column names to be treated as the identity columns, and outputs a 'long' file.
- pivot takes a 'long' input, the name of a column containing the headers to use in the output, and the name of a column to use as the values in each output cell, and outputs a 'wide' file.
I hadn't noticed xsv flatten
before, but it does seem to achieve something similar to 'melt'. The difference is that it doesn't have a concept of the 'identity' column.
xsv flatten
'd:
➜ ~ xsv flatten mtcars_wide.csv -s '' | head -n 20
car Mazda RX4
mpg 21
cyl 6
disp 160
hp 110
drat 3.9
wt 2.62
qsec 16.46
vs 0
am 1
gear 4
carb 4
car Mazda RX4 Wag
mpg 21
cyl 6
disp 160
hp 110
drat 3.9
wt 2.875
qsec 17.02
Melted:
➜ ~ head -n 20 mtcars_long_2.csv
car,variable,value
AMC Javelin,mpg,15.2
AMC Javelin,cyl,8
AMC Javelin,disp,304
AMC Javelin,hp,150
AMC Javelin,drat,3.15
AMC Javelin,wt,3.435
AMC Javelin,qsec,17.3
AMC Javelin,vs,0
AMC Javelin,am,0
AMC Javelin,gear,3
AMC Javelin,carb,2
Cadillac Fleetwood,mpg,10.4
Cadillac Fleetwood,cyl,8
Cadillac Fleetwood,disp,472
Cadillac Fleetwood,hp,205
Cadillac Fleetwood,drat,2.93
Cadillac Fleetwood,wt,5.25
Cadillac Fleetwood,qsec,17.98
Cadillac Fleetwood,vs,0
Ah okay, I see now. That makes much more sense. I'm familiar with EAV.
I think I would be in favor of adding these two sub-commands. It would probably be wise to specify the behavior when the provided identifier is not actually unique. This could have an impact on memory usage requirements.
Moreover, it sounds like converting from an EAV format back into a wide format requires storing the entire data set in memory, no?
I think I would be in favor of adding these two sub-commands. It would probably be wise to specify the behavior when the provided identifier is not actually unique. This could have an impact on memory usage requirements.
Agreed. I'll do some investigations into the behaviour of the other implementations in that situation, both when melting and pivoting.
Moreover, it sounds like converting from an EAV format back into a wide format requires storing the entire data set in memory, no?
It does unless the entire data set is already sorted along the correct columns (i.e. by the identifier(s), then the 'variable' column) and you know all the column names in advance. This is quite a strong requirement, though; I didn't realise xsv sort
didn't work out-of-memory!
On that note, are you aware of any other CLI tools which can sort without reading the whole file into memory, and/or could it be implemented in xsv sort
somehow? I guess some sort of merge sort with temporary files would work, but I'm not an expert here.
On that note, are you aware of any other CLI tools which can sort without reading the whole file into memory
I think GNU sort can do it. Whether it works on your CSV file depends, of course.
and/or could it be implemented in xsv sort somehow?
AIUI, it is a significant implementation challenge to do it both correctly and fast.