xsv icon indicating copy to clipboard operation
xsv copied to clipboard

Feature request: "aggregate" command

Open d33tah opened this issue 6 years ago • 22 comments

Let's say I have a sorted CSV file with one column and I'd like to get an aggregate view like uniq -c, but with proper CSV as an output. Do you consider this worth implementing?

d33tah avatar Jul 09 '17 11:07 d33tah

Could you please provide more detail on what exactly you want? "Output like uniq -c" isn't really descriptive enough, since it doesn't tell me what you're looking for. For example, what do you want that isn't already provided by xsv stats or xsv frequency?

BurntSushi avatar Jul 09 '17 12:07 BurntSushi

@BurntSushi thanks for drawing my attention to xsv frequency! It's pretty close to what I'm looking for, but having a hash table is not really an option - in my case I can't really keep a whole column in memory. If there was a way to add an optional assumption that the file is already sorted, that would do though...

d33tah avatar Jul 09 '17 13:07 d33tah

@d33tah Did you actually try xsv frequency? It does not store the entire column in memory.

If you only care about one column and the data is already sorted, then just do xsv select the-columns data.csv | uniq -c.

BurntSushi avatar Jul 09 '17 13:07 BurntSushi

uniq -c doesn't generate CSV and there are edge cases where it breaks output. I did try xsv frequency and with -l 0 I ran out of memory while trying to process a 6GiB file with multiple columns.

I'm new to Rust - I based the observation that it stores everything in a hashmap on this file:

https://github.com/BurntSushi/rust-stats/blob/master/src/frequency.rs

(and the fact that I ran out of RAM)

d33tah avatar Jul 09 '17 13:07 d33tah

The documentation for xsv frequency states:

Since this computes an exact frequency table, memory proportional to the cardinality of each column is required.

So it's not the entire column, but rather, all unique values in the column.

By default, xsv frequency computes a frequency table for every single column. You might have better luck computing the table for a single column. You can do that with the -s/--select flag.


More generally, adding another command that assumes the input is sorted does seem useful. But there needs to be a lot more work done on specifying what the exact UX of that is. "Something like uniq -c" is unfortunately not good enough since CSV data has multiple columns. That is, since frequency tables are computed for each column, the sort must apply to a single particular column, which is a bit at odds with how the current implementation favors looking at multiple columns. This in turn probably implies that a flag stating that the data is sorted must also require specifying a single column from which to compute the frequency table.

I'm not sure when I would work on this, but the quickest way to get someone else to do it is to put in the work to provide a proper specification. Roughly, this means specifying exactly which flags get added to xsv frequency and what their semantics are.

BurntSushi avatar Jul 09 '17 13:07 BurntSushi

To be honest, I have no idea how the UI should work here. :/

d33tah avatar Jul 09 '17 19:07 d33tah

Hello,

I think the aggregate function should work something like this. Let suppose we have this table:

Country  AccentCity       Population
es       Barañáin         22264
es       Puerto Real      36946
at       Moosburg         4602

xsv aggregate Country --sum Population --min Population --max Population --count

would produce something like:

Country   Population(sum) Population(min) Population(max) count
es              59210           22264           36946         2
at               4602            4602            4602         1

I don't know if it is worth to implement it. The main issue here I think it is the memory usage.

Interesting options would be: --sum --min --max --avg --stdev --stderr --count

micrenda avatar Jul 06 '18 10:07 micrenda

@micrenda I like the proposal, let's see what @BurntSushi thinks about it. BTW:

The main issue here I think it is the memory usage.

Why do you think so? Most (if not all) of the items you listed have constant memory complexity.

d33tah avatar Jul 06 '18 10:07 d33tah

Because you have to keep in memory the group-by keys (in my example: Country, but if could be many more keys).

Personally, it would be useful to me. But I know it is very important to use the philosophy "kiss" (keep it simple stupid).

There is another way to get the same result, using a different program (I am reporting here if someone has my same problem): q - Text as Data - https://github.com/harelba/q

In this case the solution is something like this: q "select c2, sum(c3) from ./data.csv group by c2"

micrenda avatar Jul 06 '18 10:07 micrenda

@micrenda Could you please explain why xsv stats is insufficient for your use case? It literally provides exactly the same information as shown in your example for xsv aggregate.

BurntSushi avatar Jul 06 '18 13:07 BurntSushi

First of all, I want to make clear I am not promoting this feature. I just tried to explain the intention of the first author of this thread. Personally, I think any tool must be kept as simple as possible: do just a job and do it efficiently. Implementing all possible user-cases is just impossible and could destroy the design of an elegant application.

Returning to our example, the "aggregate" would come from the SQL world. If we have a CSV file in input, it often needs to have in output another CSV with grouped values.

I give a more detailed example. Let suppose we have this input CSV

country,    region,     city,   people,     income
IT,         MI,         Milan,   10000,  50000000.
IT,         MI,         Pavia,    2000,   1000000.
IT,         RM,         Rome,     5000,   8000000.
IT,         PA,         Palermo,  5000,    200000.
FR,         PA,         Paris,   10000,  50000000.
FR,         LY,         Lyon,     6000,   2000000.
UK,         LN,         London,  34000,  60000000.

Let suppose I want to know how many inhabitants there are by country: xsv aggregate country --sum people

country,  	 sum(people)
IT, 		  22000
FR, 		  10000
FR, 		   6000
UK, 		  34000

Let suppose we want to know the max income by country: xsv aggregate country --max income

country,  	 max(income)
IT, 		  50000000
FR, 		  50000000
FR, 		   2000000
UK, 		  60000000

Now lets count how many city per region: xsv aggregate country region --count

country,    region,  count
IT,         MI,         2
IT,         RM,         1
IT,         PA,         1
FR,         PA,         1
FR,         LY,         1
UK,         LN,         1

And now something which is a no-sense, but is still valid:

xsv aggregate region --count

region,  count
MI,         2
RM,         1
PA,         2
LY,         1
LN,         1

As I said before, it is not a simple task and I think it requires a memory usage proportional to the number of distinct grouping keys. So, I am not very sure it is ok to implement it.

micrenda avatar Jul 06 '18 13:07 micrenda

@micrenda Sure... But all of those things can be done today with xsv stats and xsv frequency...

BurntSushi avatar Jul 06 '18 14:07 BurntSushi

Basically, it's not at all clear to me what @d33tah is asking for here, and how exactly it differs from what's in xsv already.

BurntSushi avatar Jul 06 '18 14:07 BurntSushi

I gave a look to xsv stats and xsv frequency and I am not very sure they are able to do it (or, at least, in a simple way). For my problem I solved using another program.

Anyway, if you don't need something like this, I don't think you should implement it. It is possible not many users need it.

micrenda avatar Jul 06 '18 14:07 micrenda

OK, there has to be a communication problem here. This is why I've been trying to ask y'all to tell me what the difference is between what you want and what xsv stats and xsv frequency already provides. But you just keep describing what you want instead of the diff. From what I can tell, all you want is the ability to compute various statistics over the data. xsv stats does that:

[andrew@Cheetah xsv-86] xsv table data.csv
country  region  city     people  income
IT       MI      Milan    10000   50000000
IT       MI      Pavia    2000    1000000
IT       RM      Rome     5000    8000000
IT       PA      Palermo  5000    200000
FR       PA      Paris    10000   50000000
FR       LY      Lyon     6000    2000000
UK       LN      London   34000   60000000
[andrew@Cheetah xsv-86] xsv stats data.csv | xsv table
field    type     sum        min     max       min_length  max_length  mean                stddev
country  Unicode             FR      UK        2           2
region   Unicode             LN      RM        2           2
city     Unicode             London  Rome      4           7
people   Integer  72000      2000    34000     4  

and you also want frequency analysis, that works too:

[andrew@Cheetah xsv-86] xsv frequency data.csv | xsv table
field    value     count
country  IT        4
country  FR        2
country  UK        1
region   PA        2
region   MI        2
region   LY        1
region   LN        1
region   RM        1
city     Paris     1
city     Pavia     1
city     Rome      1
city     Lyon      1
city     Palermo   1
city     Milan     1
city     London    1
people   10000     2
people   5000      2
people   2000      1
people   34000     1
people   6000      1
income   50000000  2
income   60000000  1
income   2000000   1
income   8000000   1
income   1000000   1
income   200000    1

The only thing that I can see that is different is the output format. Is that what y'all are asking for? The same computation but in a different output format?

BurntSushi avatar Jul 06 '18 14:07 BurntSushi

Hello,

I think I am biased by many years of SQL :) What I was thinking about was a "GROUP BY" like, with aggregate functions like "SUM, COUNT, MIN, MAX, etc.". If is not just the format, but really a way to consolidate table (it is similar to Excel Pivot tables).

Indeed the frequency and stats commands can perform a subset of these function (count, min, max) but they are not as versatile as the GROUP BY.

I suggest to leave this thread open: maybe other users will arrive here and explain their user-case. Unfortunately, I am not a rust programmer (I am C++ and Java programmer) so I can not contribute with a patch.

micrenda avatar Jul 06 '18 15:07 micrenda

@d33tah Could you please elaborate on your initial request? If this is really a request for something as elaborate as a general GROUP BY, then this will be easy for me to close, because I don't think xsv will have that. It is not a goal for xsv to become as versatile as SQL.

BurntSushi avatar Jul 06 '18 15:07 BurntSushi

Ok, let's do so. I can not work on it right now, but next week I will have some spare time to try to implement it. I wanted to learn Rust, so it is a good opportunity for me.

I will implement a patch and then I will send you. If you like, you can integrate it else no problem.

micrenda avatar Jul 06 '18 15:07 micrenda

There is another tool I just ran into that seems to do this. https://www.gnu.org/software/datamash/examples/#example_grouping

jonathanvx avatar Nov 29 '18 01:11 jonathanvx

Definitely need something like this. My particular use case is a long list of transactions (charges and payments) with an invoice number, a person's name and the transaction amount in each row (the amount can be positive or negative). I'd like to compute the total of each invoice, so something like @micrenda's suggestion above would work (slightly modified syntax to use the standard --select option):

xsv aggregate --select Name,InvoiceNo --sum Amount

elifiner avatar Mar 12 '19 02:03 elifiner

FWIW I think that sqlite3 can suffice here with .import <file.csv> table and aggregating based off that. The output of xsv can be fed into sqlite3. As well, the output of sqlite3 can be fed into xsv.

twmb avatar Sep 20 '19 06:09 twmb

Definitely need something like this. My particular use case is a long list of transactions (charges and payments) with an invoice number, a person's name and the transaction amount in each row (the amount can be positive or negative). I'd like to compute the total of each invoice, so something like @micrenda's suggestion above would work (slightly modified syntax to use the standard --select option):

xsv aggregate --select Name,InvoiceNo --sum Amount

I am using Miller for this.

mlr --csv stats1 -a sum -f 'Name,InvoiceNo' -g 'VAT_number' invoices.csv > invoices-grouped-by-vat-number.csv

jackdbd avatar Aug 30 '21 21:08 jackdbd