xsv
xsv copied to clipboard
Feature request: "aggregate" command
Let's say I have a sorted CSV file with one column and I'd like to get an aggregate view like uniq -c
, but with proper CSV as an output. Do you consider this worth implementing?
Could you please provide more detail on what exactly you want? "Output like uniq -c
" isn't really descriptive enough, since it doesn't tell me what you're looking for. For example, what do you want that isn't already provided by xsv stats
or xsv frequency
?
@BurntSushi thanks for drawing my attention to xsv frequency
! It's pretty close to what I'm looking for, but having a hash table is not really an option - in my case I can't really keep a whole column in memory. If there was a way to add an optional assumption that the file is already sorted, that would do though...
@d33tah Did you actually try xsv frequency
? It does not store the entire column in memory.
If you only care about one column and the data is already sorted, then just do xsv select the-columns data.csv | uniq -c
.
uniq -c
doesn't generate CSV and there are edge cases where it breaks output. I did try xsv frequency and with -l 0
I ran out of memory while trying to process a 6GiB file with multiple columns.
I'm new to Rust - I based the observation that it stores everything in a hashmap on this file:
https://github.com/BurntSushi/rust-stats/blob/master/src/frequency.rs
(and the fact that I ran out of RAM)
The documentation for xsv frequency
states:
Since this computes an exact frequency table, memory proportional to the cardinality of each column is required.
So it's not the entire column, but rather, all unique values in the column.
By default, xsv frequency
computes a frequency table for every single column. You might have better luck computing the table for a single column. You can do that with the -s/--select
flag.
More generally, adding another command that assumes the input is sorted does seem useful. But there needs to be a lot more work done on specifying what the exact UX of that is. "Something like uniq -c
" is unfortunately not good enough since CSV data has multiple columns. That is, since frequency tables are computed for each column, the sort must apply to a single particular column, which is a bit at odds with how the current implementation favors looking at multiple columns. This in turn probably implies that a flag stating that the data is sorted must also require specifying a single column from which to compute the frequency table.
I'm not sure when I would work on this, but the quickest way to get someone else to do it is to put in the work to provide a proper specification. Roughly, this means specifying exactly which flags get added to xsv frequency
and what their semantics are.
To be honest, I have no idea how the UI should work here. :/
Hello,
I think the aggregate function should work something like this. Let suppose we have this table:
Country AccentCity Population
es Barañáin 22264
es Puerto Real 36946
at Moosburg 4602
xsv aggregate Country --sum Population --min Population --max Population --count
would produce something like:
Country Population(sum) Population(min) Population(max) count
es 59210 22264 36946 2
at 4602 4602 4602 1
I don't know if it is worth to implement it. The main issue here I think it is the memory usage.
Interesting options would be: --sum --min --max --avg --stdev --stderr --count
@micrenda I like the proposal, let's see what @BurntSushi thinks about it. BTW:
The main issue here I think it is the memory usage.
Why do you think so? Most (if not all) of the items you listed have constant memory complexity.
Because you have to keep in memory the group-by keys (in my example: Country, but if could be many more keys).
Personally, it would be useful to me. But I know it is very important to use the philosophy "kiss" (keep it simple stupid).
There is another way to get the same result, using a different program (I am reporting here if someone has my same problem): q - Text as Data - https://github.com/harelba/q
In this case the solution is something like this:
q "select c2, sum(c3) from ./data.csv group by c2"
@micrenda Could you please explain why xsv stats
is insufficient for your use case? It literally provides exactly the same information as shown in your example for xsv aggregate
.
First of all, I want to make clear I am not promoting this feature. I just tried to explain the intention of the first author of this thread. Personally, I think any tool must be kept as simple as possible: do just a job and do it efficiently. Implementing all possible user-cases is just impossible and could destroy the design of an elegant application.
Returning to our example, the "aggregate" would come from the SQL world. If we have a CSV file in input, it often needs to have in output another CSV with grouped values.
I give a more detailed example. Let suppose we have this input CSV
country, region, city, people, income
IT, MI, Milan, 10000, 50000000.
IT, MI, Pavia, 2000, 1000000.
IT, RM, Rome, 5000, 8000000.
IT, PA, Palermo, 5000, 200000.
FR, PA, Paris, 10000, 50000000.
FR, LY, Lyon, 6000, 2000000.
UK, LN, London, 34000, 60000000.
Let suppose I want to know how many inhabitants there are by country:
xsv aggregate country --sum people
country, sum(people)
IT, 22000
FR, 10000
FR, 6000
UK, 34000
Let suppose we want to know the max income by country:
xsv aggregate country --max income
country, max(income)
IT, 50000000
FR, 50000000
FR, 2000000
UK, 60000000
Now lets count how many city per region:
xsv aggregate country region --count
country, region, count
IT, MI, 2
IT, RM, 1
IT, PA, 1
FR, PA, 1
FR, LY, 1
UK, LN, 1
And now something which is a no-sense, but is still valid:
xsv aggregate region --count
region, count
MI, 2
RM, 1
PA, 2
LY, 1
LN, 1
As I said before, it is not a simple task and I think it requires a memory usage proportional to the number of distinct grouping keys. So, I am not very sure it is ok to implement it.
@micrenda Sure... But all of those things can be done today with xsv stats
and xsv frequency
...
Basically, it's not at all clear to me what @d33tah is asking for here, and how exactly it differs from what's in xsv
already.
I gave a look to xsv stats
and xsv frequency
and I am not very sure they are able to do it (or, at least, in a simple way). For my problem I solved using another program.
Anyway, if you don't need something like this, I don't think you should implement it. It is possible not many users need it.
OK, there has to be a communication problem here. This is why I've been trying to ask y'all to tell me what the difference is between what you want and what xsv stats
and xsv frequency
already provides. But you just keep describing what you want instead of the diff. From what I can tell, all you want is the ability to compute various statistics over the data. xsv stats
does that:
[andrew@Cheetah xsv-86] xsv table data.csv
country region city people income
IT MI Milan 10000 50000000
IT MI Pavia 2000 1000000
IT RM Rome 5000 8000000
IT PA Palermo 5000 200000
FR PA Paris 10000 50000000
FR LY Lyon 6000 2000000
UK LN London 34000 60000000
[andrew@Cheetah xsv-86] xsv stats data.csv | xsv table
field type sum min max min_length max_length mean stddev
country Unicode FR UK 2 2
region Unicode LN RM 2 2
city Unicode London Rome 4 7
people Integer 72000 2000 34000 4
and you also want frequency analysis, that works too:
[andrew@Cheetah xsv-86] xsv frequency data.csv | xsv table
field value count
country IT 4
country FR 2
country UK 1
region PA 2
region MI 2
region LY 1
region LN 1
region RM 1
city Paris 1
city Pavia 1
city Rome 1
city Lyon 1
city Palermo 1
city Milan 1
city London 1
people 10000 2
people 5000 2
people 2000 1
people 34000 1
people 6000 1
income 50000000 2
income 60000000 1
income 2000000 1
income 8000000 1
income 1000000 1
income 200000 1
The only thing that I can see that is different is the output format. Is that what y'all are asking for? The same computation but in a different output format?
Hello,
I think I am biased by many years of SQL :) What I was thinking about was a "GROUP BY" like, with aggregate functions like "SUM, COUNT, MIN, MAX, etc.". If is not just the format, but really a way to consolidate table (it is similar to Excel Pivot tables).
Indeed the frequency and stats commands can perform a subset of these function (count, min, max) but they are not as versatile as the GROUP BY.
I suggest to leave this thread open: maybe other users will arrive here and explain their user-case. Unfortunately, I am not a rust programmer (I am C++ and Java programmer) so I can not contribute with a patch.
@d33tah Could you please elaborate on your initial request? If this is really a request for something as elaborate as a general GROUP BY
, then this will be easy for me to close, because I don't think xsv
will have that. It is not a goal for xsv
to become as versatile as SQL.
Ok, let's do so. I can not work on it right now, but next week I will have some spare time to try to implement it. I wanted to learn Rust, so it is a good opportunity for me.
I will implement a patch and then I will send you. If you like, you can integrate it else no problem.
There is another tool I just ran into that seems to do this. https://www.gnu.org/software/datamash/examples/#example_grouping
Definitely need something like this. My particular use case is a long list of transactions (charges and payments) with an invoice number, a person's name and the transaction amount in each row (the amount can be positive or negative). I'd like to compute the total of each invoice, so something like @micrenda's suggestion above would work (slightly modified syntax to use the standard --select
option):
xsv aggregate --select Name,InvoiceNo --sum Amount
FWIW I think that sqlite3 can suffice here with .import <file.csv> table
and aggregating based off that. The output of xsv can be fed into sqlite3. As well, the output of sqlite3 can be fed into xsv.
Definitely need something like this. My particular use case is a long list of transactions (charges and payments) with an invoice number, a person's name and the transaction amount in each row (the amount can be positive or negative). I'd like to compute the total of each invoice, so something like @micrenda's suggestion above would work (slightly modified syntax to use the standard
--select
option):xsv aggregate --select Name,InvoiceNo --sum Amount
I am using Miller for this.
mlr --csv stats1 -a sum -f 'Name,InvoiceNo' -g 'VAT_number' invoices.csv > invoices-grouped-by-vat-number.csv