DataFrames.jl
DataFrames.jl copied to clipboard
Transposing DataFrame
I developed a package to explore the idea of implementing a function to transpose DataFrame
s. The README.md file of the developed package contains some examples for reference, and can be access from
https://github.com/sl-solution/DFTranspose.jl
The transposing of data has been discuss previously, e.g. #2732,#1181, #2698, #2422, #2215, #2205, #2148, #1839, and the implemented function is trying to tackle some of them.
I think it should be a good idea to include this functionality in DataFrames.jl
.
I'll try this out in my next project and let you know what I think!
@sl-solution - just to give you a perspective. We are aware that our current design (and also in other ecosystems) is not ideal. However, it will take some time to test/give recommendations for your proposal, as this requires field testing.
@bkamins that seems ok, I am currently working on its performance and keep updating it.
Adding one more thing, I know in the current design, tabulating data (pivot and aggregation) is not part of DataFrames
, however, if it is going to change, I like to point that it is not very difficult to implement them with transposing data (using combine to calculate the summary, using transposing to display them, and modifying pretty table to show multi column heads).
This is a long standing issue https://github.com/JuliaData/DataFrames.jl/pull/1181 which I hope to resolve in the future. Simply there was no agreement about the best design for it yet.
However, I expect that the requirement:
modifying pretty table to show multi column heads
will be the hardest to handle in DataFrames.jl (I have not thought about it much yet though).
Modifying PrettyTables to have multi-column headers is very high on my wishlist of data utilities, particularly for latex, which may be the easiest case.
I was thinking about this issue and come to the conclusion that the good API for all we need here is as follows (the arg/kwarg order/naming can be changed, but I want to pass a general idea). It generally goes along what @sl-solution proposed but extends it and builds on a basic intuition that people have and our API in data frames:
pivot(data_frame, columns_to_put_in_rows, columns_to_put_incolumns, transformations..., kwargs)
I assume that columns_to_put_in_rows
is easy to decide what it should do. We would only need to decided on the row order.
With columns_to_put_incolumns
we would need to decide on the column naming scheme in case more than one column is passed.
Now the transformations...
part is crucial. Here one could put just a column, several columns or even a transformation of column. Examples (as you can see we can then use the syntax we have for select
and friends to handle things):
-
:col
(one column) -
r"x"
(several columns) -
:col => sum
(aggregate) -
AsTable(r"x") => fun
(aggregate over several columns)
The only challenge is to decide on two things: how do we want to handle when multiple rows are returned (probably as currently - just populate several rows with the same columns_to_put_in_rows
) and how to handle multiple return values (should we squeeze them into e.g. a tuple stored in a cell or expand - and if yes - horizontally or vertically; maybe this could be controlled by kwargs
)
kwargs
would handle extra options like what should be the sentinel for a missing intersection, sorting, expansion of multiple transformations.
This approach - if I understand things correctly - extends the proposal of @sl-solution and essentially makes everything that pivot table e.g. in Excel provides (and even more as we would allow not to do aggregation - pivot tables normally enforce aggregation, but in stack
/unstack
we exactly do not aggregate and people like it)
What do you think about it (and if you like it what do you think about the key design decisions I have outlined)?
Maybe it needs a little more thinking (BTW I think something like table
, tabulate
,... is a better name because it is really more than pivoting), few things
- the new functionality is mostly(only?) needed for displaying data (I guess
combine
can be used o.w.), thus, the printing is important part of the job - since it will be used for displaying, we need some functionality to let the user control each cell behaviour in some sense (this is challenging)
- we need some kind of sub-aggregation, e.g. subtotal (this needs a good design)
- I guess multiple columns in aggregation should be handled by user (o.w. it causes several problems)
Just a comment on the API here, I would like as may keyword arguments as possible instead of positional arguments. Given all arguments will be the same type (column selectors) and the order of inputs is not obvious, keyword arguments are best.
- since it will be used for displaying, we need some functionality to let the user control each cell behaviour in some sense (this is challenging)
I think people use pivots for more than just displaying. I think cell-based transformations can be handled separately than reshaping.
the new functionality is mostly(only?) needed for displaying data
For me this functionality is mostly for transforming data. If we want a flexible data structure that supports tabulating with sub-aggregations, mulit-level row and column names, possibly other number of dimensions than just two etc. this should be a separate package I think. FreqTables.jl is an example of such a package, where freqtable
that takes only length
as an aggregation function.
Given all arguments will be the same type
This is what I assumed you would say 😄. Do you have good names in mind? rows
and cols
?
I think cell-based transformations can be handled separately than reshaping.
Yes - they could. The point is that reshaping is just identity
transformation so if we support reshaping we can also support transformations. Of course transformations are not strictly required as you can always do groupby
and combine
before running pivot
(or whatever we call it) and then assume that pivot
only reshapes.
Ah, sorry. I tihnk I am confused. Is this a replacement for stack
and unstack
or is pivoting / transposing conceptually different? I'm not super familiar with Excel so I don't know all the details here.
Is this a replacement for
stack
andunstack
or is pivoting / transposing conceptually different?
The point is that, as noted by @sl-solution, we could have one function that would cover what stack
, unstack
and pivot
(proposed in #1181) would do.
As for Excel. Such table:
A | B | C | D | X |
---|---|---|---|---|
p | x | a | k | 1 |
p | x | a | l | 2 |
p | x | b | k | 3 |
p | x | b | l | 4 |
p | y | a | k | 5 |
p | y | a | l | 6 |
p | y | b | k | 7 |
p | y | b | l | 8 |
q | x | a | k | 9 |
q | x | a | l | 10 |
q | x | b | k | 11 |
q | x | b | l | 12 |
q | y | a | k | 13 |
q | y | a | l | 14 |
q | y | b | k | 15 |
q | y | b | l | 16 |
if X
is aggregated with sum
and mean
with A
and B
as rows and C
and D
as columns is shown as:
and as you can see you have sub aggregations and multi level rows and columns which I think we will not support in DataFrames.jl.
Ah okay. So the key benefit here is the nesting based on the D
column.
This is interesting, but I'm not sure its conceptually similar enough to be the same function as unstack
. But I will be interested to see an API emerge.
So the key benefit here is the nesting based on the D column.
yes - and the ability to:
- have more than one values column
- potentially transform this column
And all these three features are something that users ask about.
But what I want to avoid is:
- doing sub-aggregations (like the gray columns in the table from excel)
- the nesting of D column should be flattened and expressed as a column name
Finally if more than one values column is passed we need to decide how they are shown (Excel shows them as additional columns, but I am not sure it is the best approach if we do not support multil-level column index)
For me this functionality is mostly for transforming data....
It is indeed just aggregation and transposing in some sense. And aggregation and the proposed df_transpose() can handle it already, but it would not be displayed properly and it would lack some subtle features.
I guess this is very interesting feature (and already the main computation parts are available inside DataFrames.jl
) to include in DataFrames.jl
rather than ask people to install another package.
I think people use pivots for more than just displaying...
I can't think of any other scenario? may you give some examples.
- doing sub-aggregations (like the gray columns in the table from excel)
Those bold numbers are also part of this.
My draft idea is:
The tabulation of data is nothing more than using combine
and transpose
(perhaps multiple calls to these functions). However, displaying these computations in a table (it is why I think pivot
is not the right word - maybe it is good for EXCEL!) is the key idea. Conceptually user selects which categories go to row and which go to columns, e.g.
A | B | ||||
---|---|---|---|---|---|
I | K | I | K | ||
P | X | ||||
Y | |||||
Q | X | ||||
Y |
I was thinking about this issue and come to the conclusion that the good API for all we need here is as follows (the arg/kwarg order/naming can be changed, but I want to pass a general idea).....
probably this proposal can handle the above scenario, however, the following scenarios don't fit in it
A | B | |||||
---|---|---|---|---|---|---|
I | K | I | K | |||
P | X | mean | ||||
X | var | |||||
Y | mean | |||||
Y | var | |||||
Q | X | mean | ||||
X | var | |||||
Y | mean | |||||
Y | var |
A | B | ||||||||
---|---|---|---|---|---|---|---|---|---|
I | K | I | K | K | |||||
sum | percent | sum | percent | sum | percent | sum | percent | ||
P | X1 | ||||||||
X2 | |||||||||
X3 | |||||||||
Total | |||||||||
Q | X1 | ||||||||
X2 | |||||||||
X3 | |||||||||
Total |
Maybe we need to gather information about the variables first, roughly like, user selects which variables are defining categories, and which variables are going to be used for aggregation and then arranges them in what ever order s/he likes, for example, lets call the names of the variables in our example as V1
, V2
, V3
, and V4
, and we want mean and sum for variable V5
. Then something like
[V1, V2,[sum mean]] ,[V3, V4] creates something like the first output above, and [V1, V2] ,[V3, V4, [sum mean]] creates something like the second one.
[V1, V2] means they nested with the order given, [V1 V2] means they are at the same level. Thus, [V1, V2, [sum mean]] means, sum and mean are at the same level however they are nested in V2 which is nested in V1. Sub-aggregation may have a special names but their place can follow the same rule. Also something like [V1, V2, [V3 V4], [mean sum], V6] is valid.
Closing it as I feel this will not get ported to DataFrames.jl. Feel free to re-open if you think otherwise.