Revisit spreading for `AsTable` output`
Here is a use-case where DataFrames.jl compares unfavorably to dplyr.
Basically, the best way to do inter-dependent column transformations is to use AsTable and return a NamedTuple. However in OP's example, they want to return a scalar and a vector simultaneously. So AsTable isn't an option without some awkward spreading of their scalar.
Maybe we can spread scalar outputs when an AsTable is the dest?
and return a
NamedTuple.
The design is to:
- return a
NamedTupleif you do not want pseudo broadcasting; - return a
DataFrameif you want it.
Example:
julia> df = DataFrame(id=repeat([1, 2], 5), val=1:10)
10×2 DataFrame
Row │ id val
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 1 3
4 │ 2 4
5 │ 1 5
6 │ 2 6
7 │ 1 7
8 │ 2 8
9 │ 1 9
10 │ 2 10
julia> combine(groupby(df, :id), :val => (x -> (s=sum(x); DataFrame(total=s, frac=x ./ s))) => AsTable)
10×3 DataFrame
Row │ id total frac
│ Int64 Int64 Float64
─────┼─────────────────────────
1 │ 1 25 0.04
2 │ 1 25 0.12
3 │ 1 25 0.2
4 │ 1 25 0.28
5 │ 1 25 0.36
6 │ 2 30 0.0666667
7 │ 2 30 0.133333
8 │ 2 30 0.2
9 │ 2 30 0.266667
10 │ 2 30 0.333333
The question is how to reflect this in DataFramesMeta.jl.
Hmmm... I don't love the performance hit that would come with constructing a DataFrame. With the current implementation of the @astable macro-flag I would have to decide whether a DataFrame or a NamedTuple.
I wonder if it's best for DataFramesMeta.jl to do the broadcasting on their own.
However your response is a bit confusing
return a NamedTuple if you do not want pseudo broadcasting;
since currently returning (a = 1, b = [4, 5, 6]) throws an error. So it's not a broadcasting behavior you can opt in-or-out of.
throws an error.
Yes, because NamedTuple does not do broadcasting (as opposed to DataFrame)
Also I think that the performance hit, although noticeable, for most users would be a minor issue. The major issue, is, as you write, that @astable has to have only one meaning. Maybe @asdf or @asdataframe would be an alternative name?
Are there other advantages of making a DataFrame inside fun?
I don't really want to have to introduce both @astable and @asdataframe into the docs and tutorials if the only difference is the spreading behavior. I would just as soon do some spreading inside the anonymous function instead.
Are there other advantages of making a
DataFrameinside fun?
I do not think so. The other would be making column names unique, but I guess it is not an issue in your case.