Survey.jl
Survey.jl copied to clipboard
DataFrames-like API
It's really great that you aim to replicate features provided by the R survey package as it's really the reference in that domain. However, its API is quite ad hoc and forces users to learn a completely new syntax when moving from unweighted to weighted/survey analyses. Have you considered adopting a syntax based on DataFrames group_by
and combine
/select
/transform
? In R, the recent srvyr package does that by wrapping survey functions with a dplyr-like syntax.
In particular, svyby(:api00, [:cname, :meals], dclus1, svymean)
could be written as combine(groupby(dclus1, [:cname, :meals]), :api00 => svymean)
. With DataFramesMeta, this could become @combine(groupby(dclus1, [:cname, :meals]), svymean(:api00))
. One would also be able to compute the mean of all columns using combine(dclus1, All() .=> svymean)
.
In terms of implementation I saw that you already use combine
under the hood so it shouldn't be problematic.
Cc: @bkamins
I think this would be better and yes, it's precisely what is done under the hood. It will offer greater flexibility to the user. I've studied the srvyr package, and I've also I've identified many other places where the survey package in R is weak.
However, until 1.0, I want to hold back and just implement the survey package in Julia. The gains in speed are enough to bring users from R to Julia.
I wish to do develop the package in this manner because all researchers at xKDR, who work on survey data, use the survey package in R, and it'll be effortless for them to switch to the Julia version. Many other researchers outside xKDR, whom I know, also use the survey package in R. I think the package is at least 20 years old. It has gone into the knowledge base of many organizations.
I think a feature like:
combine(dclus1, All() .=> svymean)
will be very useful.
Also, What is there is a function that has multiple return values, for example, the fivenum function in R. I couldn't find the relevant syntax in the DataFrames documentation.
Do you know how I can correct the following?
combine(groupby(dclus1, [:cname, :meals]), :api00 => fivenum)
Here, fivenum returns multiple values.
Is this what you want?
julia> fivenum(x) = [x, 2x, 3x]
fivenum (generic function with 1 method)
julia> df = DataFrame(a=1:3)
3×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> combine(df, :a => fivenum => AsTable)
3×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
2 │ 2 4 6
3 │ 3 6 9
julia> combine(df, :a => fivenum => [:col1, :col2, :col3])
3×3 DataFrame
Row │ col1 col2 col3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
2 │ 2 4 6
3 │ 3 6 9
Many thanks, professor. I wanted it for a Grouped DataFrame, and it's working with some modification.
df = DataFrame(a = 4:7, b = ["Apple", "Orange", "Orange", "Apple"])
gdf = groupby(df, :b)
fivenum(x) = DataFrame(s = sum(x), s2 = 2*sum(x), s3 = 3*sum(x))
combine(gdf, :a => fivenum => AsTable)
I will implement this. Closing the issue.
I wanted it for a
GroupedDataFrame
All DataFrames.jl API is the same for data frame and GroupedDataFrame
.
I suggest a piping syntax like this, instead of, or in addition to svyby
, for getting mean heights by country for a design:
macro svypipe(design::AbstractSurveyDesign, args...)
# Some definitions
end
@svypipe design |> groupby(:country) |> mean(:height)
There's no need for a special macro AFAICT. Chain.jl's @chain
and other packages already works.
The rest of the package doesn't do piping, so it may look odd if it's just in one place. I am open to having pipes everywhere.
@chain
looks good, but mandatory begin ... end
doesn't look nice. Maybe there is no way around it. Perhaps Lazy is more suitable.
using Lazy
import DataFrames.groupby
@> design.data groupby(:country) combine(:height => mean)
This is similar to Stata's API, and as someone coming from C, I really detest this.
It would indeed be better to use already existing functionality. So far the solution that I like most for this is Pipe.jl (I understand that pipes are hard to write on a German keyboard, but I think the block syntax is not really appropriate for these type of operations and piping is a lot neater-looking IMO). But for now I would say wait until we implement this feature (if we implement it) because, who knows, maybe soon we'll be able to use underscores as r-values. If this gets implemented in Julia, we might be able to do something like
(design, :country) |> groupby(_, _) |> mean(_, :height)
if we add support for AbstractSurveyDesign
in groupby
(and if we change svymean
to mean
, but that's a minor aspect in the context of this discussion). If this would be possible, it would be great! It looks nice, it is clear and concise, and it is using Base
. We could also do the same thing for functions other than groupby
.
As a new user,
(design, :country) |> groupby(_, _) |> mean(_, :height)
this will be difficult to understand. Might as well do what Milan is suggesting, i.e.
@combine(groupby(design, :country), mean(:height))