Survey.jl icon indicating copy to clipboard operation
Survey.jl copied to clipboard

Testing using other survey datasets

Open smishr opened this issue 2 years ago • 5 comments

So far, most of the testing suite is limited to the API dataset. I suggest to improve testing by using other publicly available survey datasets. R Lumley survey textbook examples could be used, (pg 7 section 1.2.1) eg. NHANES, SHS, SIPP.

http://asdfree.com - Analyse Surveys Free has many real-world datasets and examples with respective R survey code.

smishr avatar Nov 02 '22 16:11 smishr

@iuliadmtru We think (a smaller and older version of) the Scottish Household Survey is a great candidate for testing with the singledesign branch.

Detailed info and data scripts as well as downloads are available on this really old website.

In the Lumley Survey textbook, you will also find multiple examples of R design and code for the older version of SHS in Chapter 6, figure 6.2 onwards, pg 110-130.

smishr avatar Jan 04 '23 06:01 smishr

The old PEAS exemplars has 6 surveys full with R code, that are reasonably 'small' for modern computers to be able to be analysed locally without too much hassle. Tests and designs can be translated from the code and explanation given here.

smishr avatar Jan 04 '23 07:01 smishr

After having a deeper look, I think we should export all of those surveys RData files that are linked in the websites, and add them into Julia assets/ folder. They arent very big, only few KB at most, and about 5-10 thousand obesrvations with weights, cluster and strata.

smishr avatar Jan 04 '23 07:01 smishr

PR #166 adds more datasets to use for testing. We should remove all the datasets within assets/ that we are not using and will not use for testing.

I added the datasets you mentioned, apart from the last two. Those are not clustered nor stratified. I think we have enough datasets now and we should focus on testing. @ayushpatnaikgit I will start testing right after you push the latest version of bootweights.

iuliadmtru avatar Jan 04 '23 14:01 iuliadmtru

Firstly, should we wget and download these datasets, or ship them part of the package?

Can we check the licenses of those datasets, and whether they are GPLv3 or similar and hence can be distributed with Survey.jl?

smishr avatar Feb 11 '23 10:02 smishr