Survey.jl icon indicating copy to clipboard operation
Survey.jl copied to clipboard

Variance Estimation for Multistage and other Complex designs

Open smishr opened this issue 2 years ago • 9 comments

Calculating survey means and totals is relatively straighforward as compared to their variances in certain designs. For example in the Horvitz Thompson estimator, their is a double inclusion probability term (\pi_{ij}) which is not readily generalised to arbitrary sampling designs. For now I implemented the Hartley-Rao variance approximation which bypasses calculation of double probabilities.

It is interesting to note how R, SAS, Stata SUDAAN etc calculate their variances. For some 'elementary' sampling designs they use closed form solutions, but for more complex multistage designs they use replication or resampling style methods like Balanced Repeated Replicates (BRR), or Jacknife, or other bootstrap type techniques. I found it difficult debugging through the R survey package code which is illegible in most places, what R exactly does under what cases.

Estimating variances of subpopulations mean/total estimators in Stata: here

SAS 9.2 documentation on Variance Estimation and here says that for Multistage sampling designs, they use a Taylor series variance method which only considers first stage of the sample design, and that for BRR and jacknife based estimation they dont take into account finite population correction.

Are these (or some other) reasonable assumptions/relaxations of the variance estimation problem for Survey.jl package? perhaps, but it is worth exploring into more detail. I think initially trying to implement the analytical solutions of variance of mean/total estimators is worth trying, especially given the power of Julia language; if that doesnt work then fall back to replicate/resampling methods.

smishr avatar Nov 03 '22 10:11 smishr

Can look into Rao-Wu bootstrap, this is the paper.

smishr avatar Nov 04 '22 06:11 smishr

Ben Schneider's blog post explains how R survey recursive variance estimation works! Other posts on the blog also useful for supporting functions

smishr avatar Nov 05 '22 17:11 smishr

SAS surveymeans procedure documentation has the formulae and explanations on the mathematical details.

smishr avatar Nov 07 '22 09:11 smishr

@sayantikaSSG please summarise your thoughts, research and analysis that you do on this topic here.

smishr avatar Nov 29 '22 05:11 smishr

Comparison of variance estimation methods in various Survey software. This lists what the different packages support by default @ayushpatnaikgit @sayantikaSSG

smishr avatar Dec 01 '22 06:12 smishr

SAS Replicate methods for variance explains the formulae involved very cleanly

smishr avatar Dec 12 '22 13:12 smishr

bumping this issue. Should prioritise to get Taylor series working soon.

smishr avatar Feb 11 '23 10:02 smishr

I would be happy to get involved in this process. I'm not that familiar with Julia, but I am quite familiar with the 'survey' package in R and the underlying estimation methods, and I'm currently working on helping Thomas Lumley incorporate a C++ version of the variance estimation functions into the 'survey' package in R.

A couple blog posts I've written on this work are below:

  • https://www.practicalsignificance.com/posts/understanding-the-survey-packages-recursive-algorithm/
  • https://www.practicalsignificance.com/posts/adding-rcpp-to-the-survey-package/

For what it's worth, I'd strongly encourage reading Thomas Lumley's book he wrote on the 'survey' package. There are several smart design decisions explained in the book, which I think would be very helpful to learn from when working on 'Survey.jl'. Especially when it comes to replicate designs and the use of the scales/rscales framework for describing how to work with various kinds of replicate weights.

bschneidr avatar Apr 10 '23 21:04 bschneidr