DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Revisit spreading for `AsTable` output`

Open pdeffebach opened this issue 2 years ago • 7 comments

Here is a use-case where DataFrames.jl compares unfavorably to dplyr.

Basically, the best way to do inter-dependent column transformations is to use AsTable and return a NamedTuple. However in OP's example, they want to return a scalar and a vector simultaneously. So AsTable isn't an option without some awkward spreading of their scalar.

Maybe we can spread scalar outputs when an AsTable is the dest?

pdeffebach avatar Dec 13 '23 17:12 pdeffebach

and return a NamedTuple.

The design is to:

  • return a NamedTuple if you do not want pseudo broadcasting;
  • return a DataFrame if you want it.

Example:

julia> df = DataFrame(id=repeat([1, 2], 5), val=1:10)
10×2 DataFrame
 Row │ id     val
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     1      5
   6 │     2      6
   7 │     1      7
   8 │     2      8
   9 │     1      9
  10 │     2     10

julia> combine(groupby(df, :id), :val => (x -> (s=sum(x); DataFrame(total=s, frac=x ./ s))) => AsTable)
10×3 DataFrame
 Row │ id     total  frac
     │ Int64  Int64  Float64
─────┼─────────────────────────
   1 │     1     25  0.04
   2 │     1     25  0.12
   3 │     1     25  0.2
   4 │     1     25  0.28
   5 │     1     25  0.36
   6 │     2     30  0.0666667
   7 │     2     30  0.133333
   8 │     2     30  0.2
   9 │     2     30  0.266667
  10 │     2     30  0.333333

The question is how to reflect this in DataFramesMeta.jl.

bkamins avatar Dec 14 '23 19:12 bkamins

Hmmm... I don't love the performance hit that would come with constructing a DataFrame. With the current implementation of the @astable macro-flag I would have to decide whether a DataFrame or a NamedTuple.

I wonder if it's best for DataFramesMeta.jl to do the broadcasting on their own.

However your response is a bit confusing

return a NamedTuple if you do not want pseudo broadcasting;

since currently returning (a = 1, b = [4, 5, 6]) throws an error. So it's not a broadcasting behavior you can opt in-or-out of.

pdeffebach avatar Dec 14 '23 20:12 pdeffebach

throws an error.

Yes, because NamedTuple does not do broadcasting (as opposed to DataFrame)

bkamins avatar Dec 14 '23 21:12 bkamins

Also I think that the performance hit, although noticeable, for most users would be a minor issue. The major issue, is, as you write, that @astable has to have only one meaning. Maybe @asdf or @asdataframe would be an alternative name?

bkamins avatar Dec 14 '23 22:12 bkamins

Are there other advantages of making a DataFrame inside fun?

I don't really want to have to introduce both @astable and @asdataframe into the docs and tutorials if the only difference is the spreading behavior. I would just as soon do some spreading inside the anonymous function instead.

pdeffebach avatar Dec 14 '23 22:12 pdeffebach

Are there other advantages of making a DataFrame inside fun?

I do not think so. The other would be making column names unique, but I guess it is not an issue in your case.

bkamins avatar Dec 15 '23 10:12 bkamins