DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

feature: `cols=:union` argument (or something like it) for `combine` with `AsTable`

Open kleinschmidt opened this issue 2 years ago • 7 comments

I have a function myfun that operates on one or more rows from my dataframe, and returns a Tables compliant output. The columns will not always be the same for every input. I'd like to be able to do something like

combine(groupby(df, :group1, :group2), [:input1, :input2] => my_fun => AsTable)

But I get an error that the keys must all be the same, and there's no cols argument to control how that is handled.

If I want to do this manually, I can do something like (thanks to @bkamins for suggesting):

reduce(vcat, [insertcols!(DataFrame(myfun(v.input1, v.input2)), k...) for (k, v) in pairs(groupby(df, :group1, :group2)] ; cols=:union)

but that's pretty clunky and you miss out on the nice transform syntax (have to manually do v.input1 etc.).

I propose adding a cols kwarg to combine to control how keys are handled from AsTable, although that my be a bit punny.

kleinschmidt avatar Feb 16 '22 18:02 kleinschmidt

I will have to think about the best design of this feature. The challenge is that assumption that all elements have the same set of columns is important for performance.

bkamins avatar Feb 16 '22 20:02 bkamins

It's also possible that another API function might better express this; unnest(df, :tblcol) is one possibility.

kleinschmidt avatar Feb 16 '22 22:02 kleinschmidt

This is something that I was thinking about. In unnest(data_frame, :col), if we wanted to be consistent with AsTable we would rely on what keys returns to identify column names. But I am not sure if this is the best approach.

Let us discuss what logic for identification of fields would be best. The issue is that keys works nicely with Dict, but not with struct. But if we go for propertynames the problem arises in the opposite direction.

bkamins avatar Feb 17 '22 08:02 bkamins

x-ref https://github.com/JuliaData/DataFrames.jl/issues/2890

bkamins avatar Feb 17 '22 20:02 bkamins

Is this what you want?

julia> df = DataFrame(nested=[(a=1, b=2), (b=3, c=4), (a=5, c=6)])
3×1 DataFrame
 Row │ nested
     │ NamedTup…      
─────┼────────────────
   1 │ (a = 1, b = 2)
   2 │ (b = 3, c = 4)
   3 │ (a = 5, c = 6)

julia> transform(df, :nested => Tables.dictrowtable => AsTable)
3×4 DataFrame
 Row │ nested          a        b        c       
     │ NamedTup…       Int64?   Int64?   Int64?  
─────┼───────────────────────────────────────────
   1 │ (a = 1, b = 2)        1        2  missing 
   2 │ (b = 3, c = 4)  missing        3        4
   3 │ (a = 5, c = 6)        5  missing        6

If yes, then we already have it. I have though opened https://github.com/JuliaData/Tables.jl/issues/274 to allow for better control of resulting column order.

bkamins avatar Feb 20 '22 15:02 bkamins

@kleinschmidt When https://github.com/JuliaData/Tables.jl/issues/274 is merged - can you please confirm that it gives you the functionality you need? Also do you think this pattern is enough or you would want to see a unnest function that would do something roughly like (of course the details will be more complex and that is why maybe adding unnest might be useful):

unnest(df, col::SingleColumnIndex) = select(df, Not(col), col => Tables.dictrowtable => AsTable)

bkamins avatar Mar 03 '22 21:03 bkamins

x-ref https://github.com/JuliaData/DataFrames.jl/issues/3116 (we will need to jointly make a decision how to handle this)

bkamins avatar Feb 05 '23 07:02 bkamins