xsv icon indicating copy to clipboard operation
xsv copied to clipboard

Feature: reshape data

Open sd2k opened this issue 5 years ago • 5 comments

I think it'd great to be able to reshape data in a similar fashion to:

It would be amazing to be able to do this with files which are too big for memory, too. That's fairly easy for 'melting' (i.e. going from 'wide' to 'long'), but less so for 'pivoting' (going from 'long' to 'wide' format), unless the data is guaranteed to be sorted by the correct columns.

Miller appears to be able to do this but leaves a little to be desired; melting seems to be very slow for extremely wide (100000+ column files), and pivoting requires all the data to be read into memory.

I can attempt a PR if you think this is a good idea!

sd2k avatar Jun 27 '19 14:06 sd2k

Could you describe this feature without pointing to other tools? I just read the links you gave, but I've found the docs to be inscrutable. If you could describe the relationship between input and output more directly, that would help. It would also help to state the relationship between this desired feature and xsv flatten.

BurntSushi avatar Jun 27 '19 14:06 BurntSushi

Of course, sorry!

The idea is to provide functionality to convert a CSV file between two formats. The formats are:

wide

This is the 'default' format you'd expect to see a CSV file in: one row per sample, one column per variable. Generally there will be one or more columns used to identify a row. Here's an example using the 'mtcars' dataset included in R. In this case, the 'car' column is the identifier.

➜  ~ xsv table mtcars_wide.csv
car                  mpg   cyl  disp   hp   drat  wt     qsec   vs  am  gear  carb
Mazda RX4            21    6    160    110  3.9   2.62   16.46  0   1   4     4
Mazda RX4 Wag        21    6    160    110  3.9   2.875  17.02  0   1   4     4
Datsun 710           22.8  4    108    93   3.85  2.32   18.61  1   1   4     1
Hornet 4 Drive       21.4  6    258    110  3.08  3.215  19.44  1   0   3     1
Hornet Sportabout    18.7  8    360    175  3.15  3.44   17.02  0   0   3     2
Valiant              18.1  6    225    105  2.76  3.46   20.22  1   0   3     1
Duster 360           14.3  8    360    245  3.21  3.57   15.84  0   0   3     4
Merc 240D            24.4  4    146.7  62   3.69  3.19   20     1   0   4     2
Merc 230             22.8  4    140.8  95   3.92  3.15   22.9   1   0   4     2
Merc 280             19.2  6    167.6  123  3.92  3.44   18.3   1   0   4     4
Merc 280C            17.8  6    167.6  123  3.92  3.44   18.9   1   0   4     4
Merc 450SE           16.4  8    275.8  180  3.07  4.07   17.4   0   0   3     3
Merc 450SL           17.3  8    275.8  180  3.07  3.73   17.6   0   0   3     3
Merc 450SLC          15.2  8    275.8  180  3.07  3.78   18     0   0   3     3
Cadillac Fleetwood   10.4  8    472    205  2.93  5.25   17.98  0   0   3     4
Lincoln Continental  10.4  8    460    215  3     5.424  17.82  0   0   3     4
Chrysler Imperial    14.7  8    440    230  3.23  5.345  17.42  0   0   3     4
Fiat 128             32.4  4    78.7   66   4.08  2.2    19.47  1   1   4     1
Honda Civic          30.4  4    75.7   52   4.93  1.615  18.52  1   1   4     2
Toyota Corolla       33.9  4    71.1   65   4.22  1.835  19.9   1   1   4     1
Toyota Corona        21.5  4    120.1  97   3.7   2.465  20.01  1   0   3     1
Dodge Challenger     15.5  8    318    150  2.76  3.52   16.87  0   0   3     2
AMC Javelin          15.2  8    304    150  3.15  3.435  17.3   0   0   3     2
Camaro Z28           13.3  8    350    245  3.73  3.84   15.41  0   0   3     4
Pontiac Firebird     19.2  8    400    175  3.08  3.845  17.05  0   0   3     2
Fiat X1-9            27.3  4    79     66   4.08  1.935  18.9   1   1   4     1
Porsche 914-2        26    4    120.3  91   4.43  2.14   16.7   0   1   5     2
Lotus Europa         30.4  4    95.1   113  3.77  1.513  16.9   1   1   5     2
Ford Pantera L       15.8  8    351    264  4.22  3.17   14.5   0   1   5     4
Ferrari Dino         19.7  6    145    175  3.62  2.77   15.5   0   1   5     6
Maserati Bora        15    8    301    335  3.54  3.57   14.6   0   1   5     8
Volvo 142E           21.4  4    121    109  4.11  2.78   18.6   1   1   4     2

long (aka Entity-Attribute-Value/EAV)

This alternative format stores the data as a set of key-value pairs attached to each ID:

➜  ~ xsv table mtcars_long.csv | head
car                  variable  value
Mazda RX4            mpg       21
Mazda RX4 Wag        mpg       21
Datsun 710           mpg       22.8
Hornet 4 Drive       mpg       21.4
Hornet Sportabout    mpg       18.7
Valiant              mpg       18.1
Duster 360           mpg       14.3
Merc 240D            mpg       24.4
Merc 230             mpg       22.8
➜  ~ xsv table mtcars_long.csv | tail
AMC Javelin          carb      2
Camaro Z28           carb      4
Pontiac Firebird     carb      2
Fiat X1-9            carb      1
Porsche 914-2        carb      2
Lotus Europa         carb      2
Ford Pantera L       carb      4
Ferrari Dino         carb      6
Maserati Bora        carb      8
Volvo 142E           carb      2

The proposed reshape command would allow users to convert between these formats using two modes:

  • melt takes a 'wide' input and a set of one or more column names to be treated as the identity columns, and outputs a 'long' file.
  • pivot takes a 'long' input, the name of a column containing the headers to use in the output, and the name of a column to use as the values in each output cell, and outputs a 'wide' file.

I hadn't noticed xsv flatten before, but it does seem to achieve something similar to 'melt'. The difference is that it doesn't have a concept of the 'identity' column.

xsv flatten'd:

➜  ~ xsv flatten mtcars_wide.csv -s '' | head -n 20
car   Mazda RX4
mpg   21
cyl   6
disp  160
hp    110
drat  3.9
wt    2.62
qsec  16.46
vs    0
am    1
gear  4
carb  4
car   Mazda RX4 Wag
mpg   21
cyl   6
disp  160
hp    110
drat  3.9
wt    2.875
qsec  17.02

Melted:

➜  ~ head -n 20 mtcars_long_2.csv
car,variable,value
AMC Javelin,mpg,15.2
AMC Javelin,cyl,8
AMC Javelin,disp,304
AMC Javelin,hp,150
AMC Javelin,drat,3.15
AMC Javelin,wt,3.435
AMC Javelin,qsec,17.3
AMC Javelin,vs,0
AMC Javelin,am,0
AMC Javelin,gear,3
AMC Javelin,carb,2
Cadillac Fleetwood,mpg,10.4
Cadillac Fleetwood,cyl,8
Cadillac Fleetwood,disp,472
Cadillac Fleetwood,hp,205
Cadillac Fleetwood,drat,2.93
Cadillac Fleetwood,wt,5.25
Cadillac Fleetwood,qsec,17.98
Cadillac Fleetwood,vs,0

sd2k avatar Jun 27 '19 15:06 sd2k

Ah okay, I see now. That makes much more sense. I'm familiar with EAV.

I think I would be in favor of adding these two sub-commands. It would probably be wise to specify the behavior when the provided identifier is not actually unique. This could have an impact on memory usage requirements.

Moreover, it sounds like converting from an EAV format back into a wide format requires storing the entire data set in memory, no?

BurntSushi avatar Jun 27 '19 16:06 BurntSushi

I think I would be in favor of adding these two sub-commands. It would probably be wise to specify the behavior when the provided identifier is not actually unique. This could have an impact on memory usage requirements.

Agreed. I'll do some investigations into the behaviour of the other implementations in that situation, both when melting and pivoting.

Moreover, it sounds like converting from an EAV format back into a wide format requires storing the entire data set in memory, no?

It does unless the entire data set is already sorted along the correct columns (i.e. by the identifier(s), then the 'variable' column) and you know all the column names in advance. This is quite a strong requirement, though; I didn't realise xsv sort didn't work out-of-memory!

On that note, are you aware of any other CLI tools which can sort without reading the whole file into memory, and/or could it be implemented in xsv sort somehow? I guess some sort of merge sort with temporary files would work, but I'm not an expert here.

sd2k avatar Jun 27 '19 17:06 sd2k

On that note, are you aware of any other CLI tools which can sort without reading the whole file into memory

I think GNU sort can do it. Whether it works on your CSV file depends, of course.

and/or could it be implemented in xsv sort somehow?

AIUI, it is a significant implementation challenge to do it both correctly and fast.

BurntSushi avatar Jun 28 '19 12:06 BurntSushi