DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Transposing DataFrame

Open sl-solution opened this issue 3 years ago • 15 comments

I developed a package to explore the idea of implementing a function to transpose DataFrames. The README.md file of the developed package contains some examples for reference, and can be access from https://github.com/sl-solution/DFTranspose.jl

The transposing of data has been discuss previously, e.g. #2732,#1181, #2698, #2422, #2215, #2205, #2148, #1839, and the implemented function is trying to tackle some of them.

I think it should be a good idea to include this functionality in DataFrames.jl.

sl-solution avatar May 01 '21 13:05 sl-solution

I'll try this out in my next project and let you know what I think!

pdeffebach avatar May 03 '21 13:05 pdeffebach

@sl-solution - just to give you a perspective. We are aware that our current design (and also in other ecosystems) is not ideal. However, it will take some time to test/give recommendations for your proposal, as this requires field testing.

bkamins avatar May 03 '21 18:05 bkamins

@bkamins that seems ok, I am currently working on its performance and keep updating it. Adding one more thing, I know in the current design, tabulating data (pivot and aggregation) is not part of DataFrames, however, if it is going to change, I like to point that it is not very difficult to implement them with transposing data (using combine to calculate the summary, using transposing to display them, and modifying pretty table to show multi column heads).

sl-solution avatar May 03 '21 23:05 sl-solution

This is a long standing issue https://github.com/JuliaData/DataFrames.jl/pull/1181 which I hope to resolve in the future. Simply there was no agreement about the best design for it yet.

However, I expect that the requirement:

modifying pretty table to show multi column heads

will be the hardest to handle in DataFrames.jl (I have not thought about it much yet though).

bkamins avatar May 03 '21 23:05 bkamins

Modifying PrettyTables to have multi-column headers is very high on my wishlist of data utilities, particularly for latex, which may be the easiest case.

pdeffebach avatar May 04 '21 00:05 pdeffebach

I was thinking about this issue and come to the conclusion that the good API for all we need here is as follows (the arg/kwarg order/naming can be changed, but I want to pass a general idea). It generally goes along what @sl-solution proposed but extends it and builds on a basic intuition that people have and our API in data frames:

pivot(data_frame, columns_to_put_in_rows, columns_to_put_incolumns, transformations..., kwargs)

I assume that columns_to_put_in_rows is easy to decide what it should do. We would only need to decided on the row order.

With columns_to_put_incolumns we would need to decide on the column naming scheme in case more than one column is passed.

Now the transformations... part is crucial. Here one could put just a column, several columns or even a transformation of column. Examples (as you can see we can then use the syntax we have for select and friends to handle things):

  • :col (one column)
  • r"x" (several columns)
  • :col => sum (aggregate)
  • AsTable(r"x") => fun (aggregate over several columns)

The only challenge is to decide on two things: how do we want to handle when multiple rows are returned (probably as currently - just populate several rows with the same columns_to_put_in_rows) and how to handle multiple return values (should we squeeze them into e.g. a tuple stored in a cell or expand - and if yes - horizontally or vertically; maybe this could be controlled by kwargs)

kwargs would handle extra options like what should be the sentinel for a missing intersection, sorting, expansion of multiple transformations.


This approach - if I understand things correctly - extends the proposal of @sl-solution and essentially makes everything that pivot table e.g. in Excel provides (and even more as we would allow not to do aggregation - pivot tables normally enforce aggregation, but in stack/unstack we exactly do not aggregate and people like it)

What do you think about it (and if you like it what do you think about the key design decisions I have outlined)?

bkamins avatar May 23 '21 20:05 bkamins

Maybe it needs a little more thinking (BTW I think something like table, tabulate,... is a better name because it is really more than pivoting), few things

  • the new functionality is mostly(only?) needed for displaying data (I guess combine can be used o.w.), thus, the printing is important part of the job
  • since it will be used for displaying, we need some functionality to let the user control each cell behaviour in some sense (this is challenging)
  • we need some kind of sub-aggregation, e.g. subtotal (this needs a good design)
  • I guess multiple columns in aggregation should be handled by user (o.w. it causes several problems)

sl-solution avatar May 23 '21 21:05 sl-solution

Just a comment on the API here, I would like as may keyword arguments as possible instead of positional arguments. Given all arguments will be the same type (column selectors) and the order of inputs is not obvious, keyword arguments are best.

  • since it will be used for displaying, we need some functionality to let the user control each cell behaviour in some sense (this is challenging)

I think people use pivots for more than just displaying. I think cell-based transformations can be handled separately than reshaping.

pdeffebach avatar May 24 '21 13:05 pdeffebach

the new functionality is mostly(only?) needed for displaying data

For me this functionality is mostly for transforming data. If we want a flexible data structure that supports tabulating with sub-aggregations, mulit-level row and column names, possibly other number of dimensions than just two etc. this should be a separate package I think. FreqTables.jl is an example of such a package, where freqtable that takes only length as an aggregation function.

Given all arguments will be the same type

This is what I assumed you would say 😄. Do you have good names in mind? rows and cols?

I think cell-based transformations can be handled separately than reshaping.

Yes - they could. The point is that reshaping is just identity transformation so if we support reshaping we can also support transformations. Of course transformations are not strictly required as you can always do groupby and combine before running pivot (or whatever we call it) and then assume that pivot only reshapes.

bkamins avatar May 24 '21 13:05 bkamins

Ah, sorry. I tihnk I am confused. Is this a replacement for stack and unstack or is pivoting / transposing conceptually different? I'm not super familiar with Excel so I don't know all the details here.

pdeffebach avatar May 24 '21 14:05 pdeffebach

Is this a replacement for stack and unstack or is pivoting / transposing conceptually different?

The point is that, as noted by @sl-solution, we could have one function that would cover what stack, unstack and pivot (proposed in #1181) would do.

bkamins avatar May 24 '21 14:05 bkamins

As for Excel. Such table:

A B C D X
p x a k 1
p x a l 2
p x b k 3
p x b l 4
p y a k 5
p y a l 6
p y b k 7
p y b l 8
q x a k 9
q x a l 10
q x b k 11
q x b l 12
q y a k 13
q y a l 14
q y b k 15
q y b l 16

if X is aggregated with sum and mean with A and B as rows and C and D as columns is shown as: image and as you can see you have sub aggregations and multi level rows and columns which I think we will not support in DataFrames.jl.

bkamins avatar May 24 '21 14:05 bkamins

Ah okay. So the key benefit here is the nesting based on the D column.

This is interesting, but I'm not sure its conceptually similar enough to be the same function as unstack. But I will be interested to see an API emerge.

pdeffebach avatar May 24 '21 15:05 pdeffebach

So the key benefit here is the nesting based on the D column.

yes - and the ability to:

  1. have more than one values column
  2. potentially transform this column

And all these three features are something that users ask about.

But what I want to avoid is:

  1. doing sub-aggregations (like the gray columns in the table from excel)
  2. the nesting of D column should be flattened and expressed as a column name

Finally if more than one values column is passed we need to decide how they are shown (Excel shows them as additional columns, but I am not sure it is the best approach if we do not support multil-level column index)

bkamins avatar May 24 '21 15:05 bkamins

For me this functionality is mostly for transforming data....

It is indeed just aggregation and transposing in some sense. And aggregation and the proposed df_transpose() can handle it already, but it would not be displayed properly and it would lack some subtle features.

I guess this is very interesting feature (and already the main computation parts are available inside DataFrames.jl) to include in DataFrames.jl rather than ask people to install another package.

I think people use pivots for more than just displaying...

I can't think of any other scenario? may you give some examples.

  1. doing sub-aggregations (like the gray columns in the table from excel)

Those bold numbers are also part of this.

My draft idea is:

The tabulation of data is nothing more than using combine and transpose (perhaps multiple calls to these functions). However, displaying these computations in a table (it is why I think pivot is not the right word - maybe it is good for EXCEL!) is the key idea. Conceptually user selects which categories go to row and which go to columns, e.g.

A B
I K I K
P X
Y
Q X
Y

I was thinking about this issue and come to the conclusion that the good API for all we need here is as follows (the arg/kwarg order/naming can be changed, but I want to pass a general idea).....

probably this proposal can handle the above scenario, however, the following scenarios don't fit in it

A B
I K I K
P X mean
X var
Y mean
Y var
Q X mean
X var
Y mean
Y var
A B
I K I K K
sum percent sum percent sum percent sum percent
P X1
X2
X3
Total
Q X1
X2
X3
Total

Maybe we need to gather information about the variables first, roughly like, user selects which variables are defining categories, and which variables are going to be used for aggregation and then arranges them in what ever order s/he likes, for example, lets call the names of the variables in our example as V1, V2, V3, and V4, and we want mean and sum for variable V5. Then something like

[V1, V2,[sum mean]] ,[V3, V4] creates something like the first output above, and [V1, V2] ,[V3, V4, [sum mean]] creates something like the second one.

[V1, V2] means they nested with the order given, [V1 V2] means they are at the same level. Thus, [V1, V2, [sum mean]] means, sum and mean are at the same level however they are nested in V2 which is nested in V1. Sub-aggregation may have a special names but their place can follow the same rule. Also something like [V1, V2, [V3 V4], [mean sum], V6] is valid.

sl-solution avatar May 24 '21 21:05 sl-solution

Closing it as I feel this will not get ported to DataFrames.jl. Feel free to re-open if you think otherwise.

bkamins avatar Dec 05 '22 10:12 bkamins