DataFrames.jl
DataFrames.jl copied to clipboard
`combine!` which modifies DataFrame
This came up yesterday on slack
We have @chain
and @pipe
etc. which make it easier to modify data without creating tons of intermediate names.
We also have transform!
, select!
, etc. to modify data frames in-place.
Maybe we should also add combine!
. It's not hard to do something like
combine!(gd, args...)
res = combine(gd, args)
df = parent(gd)
empty!(df)
for n in names(res)
df[!, n] = res[!, n]
end
return df
end
Just putting this out there as a possibility. It might make a stata-esque "declarative" workflow easier. But it might also introduce hard-to-diagnose bugs. Figured I would post as an issue so there could be some discussion.
I have not added it for two reasons:
- there is no performance benefit of
combine!
overcombine
. -
combine!
is problematic forGroupedDataFrame
as it breaks the grouping index (it is not a problem forselect!
andtransform!
which guarantee to keep rows).
So in short - we could add it, but as you indicate when used on GroupedDataFrame
it could lead to hard-to-catch bugs so I decided not to add it. What do you think given these considerations?
These are good considerations.
For 1. I do appreciate how !
means "performance benefit" and not exclusively mutation.
I guess for 2., it's always possible to "break" a GroupedDataFrame.
julia> using DataFrames
julia> df = DataFrame(a = [1, 1, 2, 2], b = [1, 2, 3, 4]);
julia> gd = groupby(df, :a);
julia> select!(df, :b);
julia> gd
GroupedDataFrame with 2 groups based on key: Error showing value of type GroupedDataFrame{DataFrame}:
ERROR: grouping column names not found in data frame column names
So if users can see the above, maybe the could also be exposed to some error with combine!
. Though it is scary to do something that is guaranteed to destroy it's input leading to an error.
Maybe DataFramesMeta can get around this with a @by!
macro when the grouped data frame is temporary anyway. But I would be hesitant to add something not in DataFrames.
We could actually change gd
as well, given that it's also mutable.
Perhaps this change could also be considered in tandem with join!
and hcat!
. I would have to do more digging to figure out what the full suite of "candidates for in-place modification" would be.
Yes - that is why I do not reject it. Also joins with !
and hcat!
are on my list.
The select!(df, :b)
is not super problematic I think (as you are modifying the parent df
). The key thing is select!(gd, transforms...)
which is safe (as this is more likely to lead to unexpected bugs). Still - as you say, we could update gd
in such a case.
Okay. I'm glad you are interested!
let's leave this open, for now.
There are other avenues to think of when it comes to emulating a "declarative" workflow. This is a promising one. Its always nice to emulate the simplicity of Stata.
I wonder if we can also add more interactivity to @chain
at the REPL. I will keep thinking on this.