Survey.jl
Survey.jl copied to clipboard
Variance Estimation for Multistage and other Complex designs
Calculating survey means and totals is relatively straighforward as compared to their variances in certain designs. For example in the Horvitz Thompson estimator, their is a double inclusion probability term (\pi_{ij}) which is not readily generalised to arbitrary sampling designs. For now I implemented the Hartley-Rao variance approximation which bypasses calculation of double probabilities.
It is interesting to note how R, SAS, Stata SUDAAN etc calculate their variances. For some 'elementary' sampling designs they use closed form solutions, but for more complex multistage designs they use replication or resampling style methods like Balanced Repeated Replicates (BRR), or Jacknife, or other bootstrap type techniques. I found it difficult debugging through the R survey
package code which is illegible in most places, what R exactly does under what cases.
Estimating variances of subpopulations mean/total estimators in Stata: here
SAS 9.2 documentation on Variance Estimation and here says that for Multistage sampling designs, they use a Taylor series variance method which only considers first stage of the sample design, and that for BRR and jacknife based estimation they dont take into account finite population correction.
Are these (or some other) reasonable assumptions/relaxations of the variance estimation problem for Survey.jl package? perhaps, but it is worth exploring into more detail. I think initially trying to implement the analytical solutions of variance of mean/total estimators is worth trying, especially given the power of Julia language; if that doesnt work then fall back to replicate/resampling methods.
Can look into Rao-Wu bootstrap, this is the paper.
Ben Schneider's blog post explains how R survey recursive variance estimation works! Other posts on the blog also useful for supporting functions
SAS surveymeans procedure documentation has the formulae and explanations on the mathematical details.
@sayantikaSSG please summarise your thoughts, research and analysis that you do on this topic here.
Comparison of variance estimation methods in various Survey software. This lists what the different packages support by default @ayushpatnaikgit @sayantikaSSG
SAS Replicate methods for variance explains the formulae involved very cleanly
bumping this issue. Should prioritise to get Taylor series working soon.
I would be happy to get involved in this process. I'm not that familiar with Julia, but I am quite familiar with the 'survey' package in R and the underlying estimation methods, and I'm currently working on helping Thomas Lumley incorporate a C++ version of the variance estimation functions into the 'survey' package in R.
A couple blog posts I've written on this work are below:
- https://www.practicalsignificance.com/posts/understanding-the-survey-packages-recursive-algorithm/
- https://www.practicalsignificance.com/posts/adding-rcpp-to-the-survey-package/
For what it's worth, I'd strongly encourage reading Thomas Lumley's book he wrote on the 'survey' package. There are several smart design decisions explained in the book, which I think would be very helpful to learn from when working on 'Survey.jl'. Especially when it comes to replicate designs and the use of the scales/rscales
framework for describing how to work with various kinds of replicate weights.