DataFrames.jl
DataFrames.jl copied to clipboard
feature: `cols=:union` argument (or something like it) for `combine` with `AsTable`
I have a function myfun
that operates on one or more rows from my dataframe, and returns a Tables compliant output. The columns will not always be the same for every input. I'd like to be able to do something like
combine(groupby(df, :group1, :group2), [:input1, :input2] => my_fun => AsTable)
But I get an error that the keys must all be the same, and there's no cols
argument to control how that is handled.
If I want to do this manually, I can do something like (thanks to @bkamins for suggesting):
reduce(vcat, [insertcols!(DataFrame(myfun(v.input1, v.input2)), k...) for (k, v) in pairs(groupby(df, :group1, :group2)] ; cols=:union)
but that's pretty clunky and you miss out on the nice transform syntax (have to manually do v.input1
etc.).
I propose adding a cols
kwarg to combine
to control how keys are handled from AsTable
, although that my be a bit punny.
I will have to think about the best design of this feature. The challenge is that assumption that all elements have the same set of columns is important for performance.
It's also possible that another API function might better express this; unnest(df, :tblcol)
is one possibility.
This is something that I was thinking about. In unnest(data_frame, :col)
, if we wanted to be consistent with AsTable
we would rely on what keys
returns to identify column names. But I am not sure if this is the best approach.
Let us discuss what logic for identification of fields would be best. The issue is that keys
works nicely with Dict
, but not with struct
. But if we go for propertynames
the problem arises in the opposite direction.
x-ref https://github.com/JuliaData/DataFrames.jl/issues/2890
Is this what you want?
julia> df = DataFrame(nested=[(a=1, b=2), (b=3, c=4), (a=5, c=6)])
3×1 DataFrame
Row │ nested
│ NamedTup…
─────┼────────────────
1 │ (a = 1, b = 2)
2 │ (b = 3, c = 4)
3 │ (a = 5, c = 6)
julia> transform(df, :nested => Tables.dictrowtable => AsTable)
3×4 DataFrame
Row │ nested a b c
│ NamedTup… Int64? Int64? Int64?
─────┼───────────────────────────────────────────
1 │ (a = 1, b = 2) 1 2 missing
2 │ (b = 3, c = 4) missing 3 4
3 │ (a = 5, c = 6) 5 missing 6
If yes, then we already have it. I have though opened https://github.com/JuliaData/Tables.jl/issues/274 to allow for better control of resulting column order.
@kleinschmidt When https://github.com/JuliaData/Tables.jl/issues/274 is merged - can you please confirm that it gives you the functionality you need?
Also do you think this pattern is enough or you would want to see a unnest
function that would do something roughly like (of course the details will be more complex and that is why maybe adding unnest
might be useful):
unnest(df, col::SingleColumnIndex) = select(df, Not(col), col => Tables.dictrowtable => AsTable)
x-ref https://github.com/JuliaData/DataFrames.jl/issues/3116 (we will need to jointly make a decision how to handle this)