DataFrames.jl Revisit spreading for `AsTable` output`

Here is a use-case where DataFrames.jl compares unfavorably to dplyr.

Basically, the best way to do inter-dependent column transformations is to use AsTable and return a NamedTuple. However in OP's example, they want to return a scalar and a vector simultaneously. So AsTable isn't an option without some awkward spreading of their scalar.

Maybe we can spread scalar outputs when an AsTable is the dest?

Dec 13 '23 17:12 pdeffebach

and return a NamedTuple.

The design is to:

return a NamedTuple if you do not want pseudo broadcasting;
return a DataFrame if you want it.

Example:

julia> df = DataFrame(id=repeat([1, 2], 5), val=1:10)
10×2 DataFrame
 Row │ id     val
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     1      5
   6 │     2      6
   7 │     1      7
   8 │     2      8
   9 │     1      9
  10 │     2     10

julia> combine(groupby(df, :id), :val => (x -> (s=sum(x); DataFrame(total=s, frac=x ./ s))) => AsTable)
10×3 DataFrame
 Row │ id     total  frac
     │ Int64  Int64  Float64
─────┼─────────────────────────
   1 │     1     25  0.04
   2 │     1     25  0.12
   3 │     1     25  0.2
   4 │     1     25  0.28
   5 │     1     25  0.36
   6 │     2     30  0.0666667
   7 │     2     30  0.133333
   8 │     2     30  0.2
   9 │     2     30  0.266667
  10 │     2     30  0.333333

The question is how to reflect this in DataFramesMeta.jl.

Dec 14 '23 19:12 bkamins

Hmmm... I don't love the performance hit that would come with constructing a DataFrame. With the current implementation of the @astable macro-flag I would have to decide whether a DataFrame or a NamedTuple.

I wonder if it's best for DataFramesMeta.jl to do the broadcasting on their own.

However your response is a bit confusing

return a NamedTuple if you do not want pseudo broadcasting;

since currently returning (a = 1, b = [4, 5, 6]) throws an error. So it's not a broadcasting behavior you can opt in-or-out of.

Dec 14 '23 20:12 pdeffebach

throws an error.

Yes, because NamedTuple does not do broadcasting (as opposed to DataFrame)

Dec 14 '23 21:12 bkamins

Also I think that the performance hit, although noticeable, for most users would be a minor issue. The major issue, is, as you write, that @astable has to have only one meaning. Maybe @asdf or @asdataframe would be an alternative name?

Dec 14 '23 22:12 bkamins

Are there other advantages of making a DataFrame inside fun?

I don't really want to have to introduce both @astable and @asdataframe into the docs and tutorials if the only difference is the spreading behavior. I would just as soon do some spreading inside the anonymous function instead.

Dec 14 '23 22:12 pdeffebach

Are there other advantages of making a DataFrame inside fun?

I do not think so. The other would be making column names unique, but I guess it is not an issue in your case.

Dec 15 '23 10:12 bkamins