srvyr icon indicating copy to clipboard operation
srvyr copied to clipboard

Allow `as_survey_design` objects to work inside `tidymodels`

Open themichjam opened this issue 2 years ago • 9 comments

I'm wondering if it would be possible to allow srvyr survey design objects to work with tidymodels. Example below (with error):

library(survey)
library(srvyr)
library(tidymodels)
data(api)

# stratified sample
dstrata <- apistrat %>%
  as_survey_design(strata = stype,
weights = pw)

# initial_split
dstrat_split <- initial_split(dstrata)

# Error that occurs from above

Error in `stop_subscript()`:
! Can't subset columns that don't exist.
x Column `strata` doesn't exist.

themichjam avatar Feb 19 '22 21:02 themichjam

This is a cool idea! There's a part of me that wants to use this as an excuse to learn tidymodels, but I have no idea when I'll have time to do so.

gergness avatar Feb 19 '22 21:02 gergness

This is a cool idea! There's a part of me that wants to use this as an excuse to learn tidymodels, but I have no idea when I'll have time to do so.

I would love to help work on it (I'll be doing it as part of my PhD work anyway), and I think it would be a great integration.

themichjam avatar Feb 19 '22 22:02 themichjam

I have vague memory that splitting surveys into training and testing data sets is non-trivial because the data is not iid, ie imagine if by chance the training data set omitted a strata then the models generated by the training set would be biased

carlganz avatar Feb 19 '22 22:02 carlganz

I have vague memory that splitting surveys into training and testing data sets is non-trivial because the data is not iid, ie imagine if by chance the training data set omitted a strata then the models generated by the training set would be biased

If you end up remembering where you came across this, I would love the resource! Thanks for your advice.

themichjam avatar Feb 19 '22 22:02 themichjam

This is an intriguing idea, and the R community could really use some tools for incorporating complex designs into modeling/machine-learning. The big challenges here are much more statistical ("what's the right thing to do?") than about API design ("how do we write tidymodels S3 methods for survey design objects?"). To be clearer, I've tried to describe below the two big statistical challenges here.

For the model validation and machine learning algorithms users are turning to {tidymodels} for, the {survey} package doesn't implement methods, and so the {srvyr} package doesn't have something we can just provide a wrapper around in order to give correct results. For that reason, some statistical decisions have to be made, and so I think that {srvyr} is probably not the best fit for this.

Nonetheless, I think it's a great idea to provide a {tidymodels} compatible interface to survey design objects. The end of this ridiculously-long comment has some suggestions about ways to go forward.

Interfacing with modeling packages when only a handful do the right thing for complex surveys

The main challenge to my mind is that there are very few modeling packages that take into account complex survey design features correctly when producing standard errors, confidence intervals, and the like. The {survey}, {svyVGAM}, and {rpms} are the only such packages I'm aware of that take into account complex design features for regression or tree-based models.

Otherwise, modeling packages would interface with survey design object simply by accessing the data frame of variables from the survey and maybe the weights. But the inferential statistics (p-values, AUROC, etc.) would not actually take into account things like stratification, clustering, raking, etc. Even for modeling functions which accept a weights argument, it's not clear that the weights are being used correctly for a survey context.

So if {srvyr} had an interface for {tidymodels} functionality for fitting models, it would in most cases likely just pass along the data and weights to a modeling function and provide loud warnings in most circumstances to alert the user that the complex design is being ignored when fitting models.

Figuring out what to do for splitting/cross-validation

The other challenge that @carlganz brings up is that it's not clear how to appropriately do training/testing/cross-validation for complex designs; this is an unsettled problem with some active research going on. Here's an interesting paper (not yet fully published) and nice accompanying accessible presentation on the topic. The authors just last month published on CRAN an R package for cross-validation with surveys, {surveyCV}.

Some ideas for taking this idea forward

I think it would be very helpful to have a {tidymodels}-compatible package that provides methods for {tidymodels} functions, drawing on the {surveyCV} R package to figure out splitting/CV and provides interfaces to the modeling functions of {survey}, {srvyr}, {svyVGAM}, and {rpms}. If you want to take the lead @themichjam on a package like that, I'd be happy to contribute and I'm sure there are things that the {srvyr} package can add to help.

bschneidr avatar Feb 19 '22 23:02 bschneidr

This is an intriguing idea, and the R community could really use some tools for incorporating complex designs into modeling/machine-learning. The big challenges here are much more statistical ("what's the right thing to do?") than about API design ("how do we write tidymodels S3 methods for survey design objects?"). To be clearer, I've tried to describe below the two big statistical challenges here.

For the model validation and machine learning algorithms users are turning to {tidymodels} for, the {survey} package doesn't implement methods, and so the {srvyr} package doesn't have something we can just provide a wrapper around in order to give correct results. For that reason, some statistical decisions have to be made, and so I think that {srvyr} is probably not the best fit for this.

Nonetheless, I think it's a great idea to provide a {tidymodels} compatible interface to survey design objects. The end of this ridiculously-long comment has some suggestions about ways to go forward.

Interfacing with modeling packages when only a handful do the right thing for complex surveys

The main challenge to my mind is that there are very few modeling packages that take into account complex survey design features correctly when producing standard errors, confidence intervals, and the like. The {survey}, {svyVGAM}, and {rpms} are the only such packages I'm aware of that take into account complex design features for regression or tree-based models.

Otherwise, modeling packages would interface with survey design object simply by accessing the data frame of variables from the survey and maybe the weights. But the inferential statistics (p-values, AUROC, etc.) would not actually take into account things like stratification, clustering, raking, etc. Even for modeling functions which accept a weights argument, it's not clear that the weights are being used correctly for a survey context.

So if {srvyr} had an interface for {tidymodels} functionality for fitting models, it would in most cases likely just pass along the data and weights to a modeling function and provide loud warnings in most circumstances to alert the user that the complex design is being ignored when fitting models.

Figuring out what to do for splitting/cross-validation

The other challenge that @carlganz brings up is that it's not clear how to appropriately do training/testing/cross-validation for complex designs; this is an unsettled problem with some active research going on. Here's an interesting paper (not yet fully published) and nice accompanying accessible presentation on the topic. The authors just last month published on CRAN an R package for cross-validation with surveys, {surveyCV}.

Some ideas for taking this idea forward

I think it would be very helpful to have a {tidymodels}-compatible package that provides methods for {tidymodels} functions, drawing on the {surveyCV} R package to figure out splitting/CV and provides interfaces to the modeling functions of {survey}, {srvyr}, {svyVGAM}, and {rpms}. If you want to take the lead @themichjam on a package like that, I'd be happy to contribute and I'm sure there are things that the {srvyr} package can add to help.

Thanks so much Ben! You make some really great points, and your right about the lack of tools for integrating complex survey designs into machine learning and modelling (I've only started asking this question in the last month with only 5 months left on my PhD work in which I wanted to feature this exact thing). I'd be more than happy to take the lead on a package, and it would be great to work with you (and anyone who finds this and wants to get involved).

themichjam avatar Feb 20 '22 14:02 themichjam

Cool! And yeah, thanks so much Ben! I can't really help with the "should this be done" or any of the statistical knowledge, but would only be able to help with syntactic issues and R coding bug stuff.

FWIW, I did a quick scan through tidymodels, and there is unfortunately not much use of generics that would allow one to write a package that has commands literally with the same name as their tidymodels equivalents. Instead I think the value would be in making functions for someone who is used to the tidymodels workflow to easily update.

Also the reason the command fails is because the survey package's objects (and therefore srvyr's) are not true data.frames, instead the data.frame is stored as the $variables in the object.

library(srvyr, warn.conflicts = FALSE)
data(api, package = "survey")

dstrata <- apistrat %>%
  as_survey_design(strata = stype, weights = pw)


names(apistrat)
#>  [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
#>  [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
#> [13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
#> [19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
#> [25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
#> [31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
#> [37] "api.stu"  "pw"       "fpc"

name(dstrata)
#> Error in name(dstrata): could not find function "name"

names(dstrata$variables)
#>  [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
#>  [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
#> [13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
#> [19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
#> [25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
#> [31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
#> [37] "api.stu"  "pw"       "fpc"

gergness avatar Feb 20 '22 17:02 gergness

Cool! And yeah, thanks so much Ben! I can't really help with the "should this be done" or any of the statistical knowledge, but would only be able to help with syntactic issues and R coding bug stuff.

FWIW, I did a quick scan through tidymodels, and there is unfortunately not much use of generics that would allow one to write a package that has commands literally with the same name as their tidymodels equivalents. Instead I think the value would be in making functions for someone who is used to the tidymodels workflow to easily update.

Also the reason the command fails is because the survey package's objects (and therefore srvyr's) are not true data.frames, instead the data.frame is stored as the $variables in the object.

library(srvyr, warn.conflicts = FALSE)
data(api, package = "survey")

dstrata <- apistrat %>%
  as_survey_design(strata = stype, weights = pw)


names(apistrat)
#>  [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
#>  [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
#> [13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
#> [19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
#> [25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
#> [31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
#> [37] "api.stu"  "pw"       "fpc"

name(dstrata)
#> Error in name(dstrata): could not find function "name"

names(dstrata$variables)
#>  [1] "cds"      "stype"    "name"     "sname"    "snum"     "dname"   
#>  [7] "dnum"     "cname"    "cnum"     "flag"     "pcttest"  "api00"   
#> [13] "api99"    "target"   "growth"   "sch.wide" "comp.imp" "both"    
#> [19] "awards"   "meals"    "ell"      "yr.rnd"   "mobility" "acs.k3"  
#> [25] "acs.46"   "acs.core" "pct.resp" "not.hsg"  "hsg"      "some.col"
#> [31] "col.grad" "grad.sch" "avg.ed"   "full"     "emer"     "enroll"  
#> [37] "api.stu"  "pw"       "fpc"

Out of curiosity @gergness , is there any way to circumvent this for now, pre-development of @bschneidr package idea (e.g. weight the data without it being a survey design object, so it could then be plugged into tidymodels)

themichjam avatar Feb 20 '22 21:02 themichjam

Just playing around with the structure that's returned from initial_split reveals that it stores the data in $data, so you could run as_survey() on that.

EDIT: Actually no, this doesn't work - the training and testing aren't subset.

library(srvyr)
library(tidymodels)
library(dplyr)

data(api, package = "survey")
data_split <- initial_split(apistrat)

# The data is stored here:
all.equal(data_split$data, apistrat)
#> [1] TRUE


# So maybe you can do this?
data_split$data <- data_split$data %>%
    as_survey(strata = stype, weigths = pw)


# Doesn't actually work.
nrow(training(data_split))
#> 200
nrow(testing(data_split))
#> 200

Created on 2022-02-22 by the reprex package (v2.0.1)

gergness avatar Feb 22 '22 16:02 gergness