DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

`combine!` which modifies DataFrame

Open pdeffebach opened this issue 3 years ago • 4 comments

This came up yesterday on slack

We have @chain and @pipe etc. which make it easier to modify data without creating tons of intermediate names.

We also have transform!, select!, etc. to modify data frames in-place.

Maybe we should also add combine!. It's not hard to do something like

combine!(gd, args...)
    res = combine(gd, args)
    df = parent(gd)
    empty!(df)
    for n in names(res)
        df[!, n] = res[!, n]
    end
    return df
end

Just putting this out there as a possibility. It might make a stata-esque "declarative" workflow easier. But it might also introduce hard-to-diagnose bugs. Figured I would post as an issue so there could be some discussion.

pdeffebach avatar Apr 09 '21 15:04 pdeffebach

I have not added it for two reasons:

  1. there is no performance benefit of combine! over combine.
  2. combine! is problematic for GroupedDataFrame as it breaks the grouping index (it is not a problem for select! and transform! which guarantee to keep rows).

So in short - we could add it, but as you indicate when used on GroupedDataFrame it could lead to hard-to-catch bugs so I decided not to add it. What do you think given these considerations?

bkamins avatar Apr 09 '21 17:04 bkamins

These are good considerations.

For 1. I do appreciate how ! means "performance benefit" and not exclusively mutation.

I guess for 2., it's always possible to "break" a GroupedDataFrame.

julia> using DataFrames

julia> df = DataFrame(a = [1, 1, 2, 2], b = [1, 2, 3, 4]);

julia> gd = groupby(df, :a);

julia> select!(df, :b);

julia> gd
GroupedDataFrame with 2 groups based on key: Error showing value of type GroupedDataFrame{DataFrame}:
ERROR: grouping column names not found in data frame column names

So if users can see the above, maybe the could also be exposed to some error with combine!. Though it is scary to do something that is guaranteed to destroy it's input leading to an error.

Maybe DataFramesMeta can get around this with a @by! macro when the grouped data frame is temporary anyway. But I would be hesitant to add something not in DataFrames.

We could actually change gd as well, given that it's also mutable.

Perhaps this change could also be considered in tandem with join! and hcat!. I would have to do more digging to figure out what the full suite of "candidates for in-place modification" would be.

pdeffebach avatar Apr 09 '21 17:04 pdeffebach

Yes - that is why I do not reject it. Also joins with ! and hcat! are on my list.

The select!(df, :b) is not super problematic I think (as you are modifying the parent df). The key thing is select!(gd, transforms...) which is safe (as this is more likely to lead to unexpected bugs). Still - as you say, we could update gd in such a case.

bkamins avatar Apr 09 '21 17:04 bkamins

Okay. I'm glad you are interested!

let's leave this open, for now.

There are other avenues to think of when it comes to emulating a "declarative" workflow. This is a promising one. Its always nice to emulate the simplicity of Stata.

I wonder if we can also add more interactivity to @chain at the REPL. I will keep thinking on this.

pdeffebach avatar Apr 09 '21 17:04 pdeffebach