gtsummary Add a `add_ci.tbl_svysummary()` function

Use survey::svyciprop() to calculate CIs

Aug 22 '21 22:08 ddsjoberg

Hello, I look forward to the release of this new function. For whatever it's worth, it would be great to have this function include an option to generate confidence intervals for proportions or point estimates.

Here are example tables of I created to depict confidence intervals of the:

proportion estimate (i.e., using survey::svby with function svyciprop), and gtsummary table with CI proportions

point estimate (i.e., using survey::svby with function svytotal) gtsummary table with CI point estimates

Dec 06 '21 04:12 trevinoj

hey @ddsjoberg - hope all good; just wondering if you have an estimated timeline for when might be able to implement this? We are migrating the above project to use {gtsummary} - might have some funding available if it helps (as above we considered writing minor wrappers in the meantime, but probably more efficient if we could just help push the package forward directly) Might not be helpful but as aside there is some code here for a tidy style wrapper using {srvyr} (which itself is a wrapper of {survey}... Happy to jump on a call some time if easier!

Feb 27 '22 12:02 aspina7

hello @trevinoj and @aspina7 ,

I don't have at timeline for this function. Perhaps I can find someone at my institutions next hackathon in April who would be interested in developing.

If either of you would like to implement, please be sure to model the function after add_ci.tbl_summary(): perhaps, the two functions have be entirely integrated. Please know if advance, I may not have time in the next couple of months to review a pull request.

Feb 27 '22 12:02 ddsjoberg

thanks @ddsjoberg - no worries, in that case we might just work on a wrapper over on our repo for the meantime and then touch base with you in april to see about helping out with add_ci.tbl_summary() or starting a pr for review when you have more time. @amygimma fyi

Feb 27 '22 16:02 aspina7

@trevinoj @aspina7 did you ever write wrappers for this type of functionality? If so, can you link to it? If not, would you be able to provide the following minimal code examples? It will be helpful when implementing.

Code for calculating the CI for a continuous variable's mean
Code for calculating the CI for a continuous variable's mean when there is tbl_svysummary(by=) present

If you can use the example data set from the tbl_svysummary() help file, that would be easiest for me to review. Also, please keep the example as minimal as possible

Apr 04 '22 01:04 ddsjoberg

hi @ddsjoberg sorry for delay in replying - piled under books again. Haven't managed getting a functioning wrapper going, but see minimal examples below. This wrapper used in the {srvyr} functions does a lot of heavy lifting for calculating confidence intervals. Having all those params in {gtsummary} would be awesome.

You already know this but doing props with CI for categorical vars is a bit of a pain in {survey} - and not entirely sure that {srvyr} has a solution for it either... see notes here

See reprex below for CIs with means comparing {survey} and {srvyr}.

library(survey)
library(srvyr)

# A dataset with a complex design
data(api, package = "survey")

## create a {survey} design object
surveyobj <- survey::svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)

## create a {srvyr} design object
srvyrobj  <- apiclus1 %>% 
  srvyr::as_survey_design(id = dnum, weights = pw, fpc = fpc) 

######## CI and design effect for a continuous vars mean 

## calculate the mean academic performance index (and design effect)

# using {survey}
stat <- svymean(~api00, surveyobj, na.rm = TRUE, deff = TRUE) 
stats::confint(stat, level = 0.95)
#>          2.5 %   97.5 %
#> api00 598.0275 690.3113
deff(stat)
#>    api00 
#> 9.345869

# using {srvyr}
srvyrobj %>% 
  summarise(
    srvyr::survey_mean(api00, vartype = "ci", deff = TRUE))
#> # A tibble: 1 x 4
#>    coef `_low` `_upp` `_deff`
#>   <dbl>  <dbl>  <dbl>   <dbl>
#> 1  644.   594.   695.    9.35

######## Stratified CI and design effect for a continuous vars mean 

stat <- svyby(~api00, ~stype, surveyobj, svymean, deff = TRUE)
stats::confint(stat, level = 0.95)
#>      2.5 %   97.5 %
#> E 605.0385 692.6976
#> H 544.0531 693.0897
#> M 569.4866 693.3934
deff(stat)
#> [1] 6.583674 2.228259 2.163900

srvyrobj %>% 
  group_by(stype) %>% 
  summarise(
    srvyr::survey_mean(api00, vartype = "ci", deff = TRUE)
  )
#> # A tibble: 3 x 5
#>   stype  coef `_low` `_upp` `_deff`
#>   <fct> <dbl>  <dbl>  <dbl>   <dbl>
#> 1 E      649.   601.   697.    6.58
#> 2 H      619.   537.   700.    2.23
#> 3 M      631.   564.   699.    2.16

^{Created on 2022-04-10 by the reprex package (v2.0.1)}

Apr 10 '22 11:04 aspina7

Thanks @aspina7 . I'm just looking at the code on my phone at the moment, but do you have code to calculate the CIs using survey pkg? We dont use svry internally and I don't want to add another dependency for the CIs. Thanks!

Apr 10 '22 12:04 ddsjoberg

yep - the above shows both {survey} and {srvyr}. The {survey} examples are the best I can muster without going in to some very dense iteration. If you take a look at the {srvyr} wrapper I linked to though it uses {survey} functions.

Apr 10 '22 13:04 aspina7

The challenge is the computation of CI for proportions: you need to transform the variable into a binary variable and to call svyciprop() => which means that computation has to be done cell per cell. You also need to know if we are displaying row proportion or col proportions to apply the relevant subset() before doing the computation.

library(survey)
#> Le chargement a nécessité le package : grid
#> Le chargement a nécessité le package : Matrix
#> Le chargement a nécessité le package : survival
#> 
#> Attachement du package : 'survey'
#> L'objet suivant est masqué depuis 'package:graphics':
#> 
#>     dotchart
data(api)
d <- svydesign(id=~dnum, fpc=~fpc, data=apiclus1)

library(gtsummary)

d %>%
  tbl_svysummary(
    include = stype,
    by = both
  ) %>%
  add_overall(last = TRUE) %>%
  as_kable()

Characteristic	No, N = 2,523	Yes, N = 6,712	Overall, N = 9,235
stype
E	1,615 (64%)	5,652 (84%)	7,267 (79%)
H	353 (14%)	353 (5.3%)	707 (7.7%)
M	555 (22%)	707 (11%)	1,262 (14%)

# CI for first cell
tmp <- subset(d, both == "No")
svyciprop(~I(stype == "E"), tmp)
#>                        2.5% 97.5%
#> I(stype == "E") 0.640 0.424  0.81


d %>%
  tbl_svysummary(
    include = stype,
    by = both,
    percent = "row"
  ) %>%
  add_overall(last = TRUE) %>%
  as_kable()

Characteristic	No, N = 2,523	Yes, N = 6,712	Overall, N = 9,235
stype
E	1,615 (22%)	5,652 (78%)	7,267 (100%)
H	353 (50%)	353 (50%)	707 (100%)
M	555 (44%)	707 (56%)	1,262 (100%)

# CI for first cell
tmp <- subset(d, stype == "E")
svyciprop(~I(both == "No"), tmp)
#>                        2.5% 97.5%
#> I(both == "No") 0.222 0.165  0.29


d %>%
  tbl_svysummary(
    include = stype,
    by = both,
    percent = "cell"
  ) %>%
  add_overall(last = TRUE) %>%
  as_kable()

Characteristic	No, N = 2,523	Yes, N = 6,712	Overall, N = 9,235
stype
E	1,615 (17%)	5,652 (61%)	7,267 (79%)
H	353 (3.8%)	353 (3.8%)	707 (7.7%)
M	555 (6.0%)	707 (7.7%)	1,262 (14%)

# CI for first cell
svyciprop(~I(stype == "E" & both == "No"), d)
#>                                       2.5% 97.5%
#> I(stype == "E" & both == "No") 0.175 0.126  0.24

^{Created on 2022-04-11 by the reprex package (v2.0.1)}

Apr 11 '22 14:04 larmarange

yeap exactly - the binary bit is the real pain.

below is an example of a very hideously outdated basic function used for a tutorial a while back

# Function used to calculate weighted proportions, CI and design effect
svy_prop <- function(x, design) {
p1 <- round(svyciprop(as.formula(paste0( "~" , x)), design, na.rm = T) * 100, digits = 2)
p2 <- round(confint(p1) * 100, digits = 2)
p3 <- deff(round(svymean(as.formula(paste0( "~" , x)), design, na.rm = T, deff = T) * 100, digits = 2))
p4 <- cbind("Proportion" = p1, p2, "Design effect" = p3)
}

# Function used to calculate weighted proportions, CI and design effect for categorical variables
svy_prop_cat <- function(x, dataset, designids, designweight, designstrata){
  
  #create a list to save outputs in 
  
  outer <- list() 
  
  #change each of the variables of interest to factor
  dataset[,x] <- as.factor(dataset[,x])
  
  #For each level of the categorical variable of interest
  for (i in levels(dataset[,x])){
    
    #create a new binary variable called var temp for each level
    dataset[,paste0(x, "temp")] <- ifelse(dataset[,x] == i, 1, 0)
    
    #define your design
    design <- svydesign(ids = as.formula(paste0("~",designids)), 
                        weights = as.formula(paste0("~",designweight)), 
                        strata = as.formula(paste0("~",designstrata)), 
                        data = dataset)
    
    #use the svy_prop function on binary variable
    overall <- svy_prop(paste0(var ,"temp"), design = design)
    
    #label the rows of your output with the variable and category level
    rownames(overall) <- paste0(var,": ", i)
    
    #save within the list 
    outer[[var]][[i]] <- overall
  }
  
  #bind the output from the list in to a dataframe
  do.call(rbind, outer[[1]][1:length(outer[[1]])])
}

Apr 11 '22 16:04 aspina7

Moving discussion from #1187 here. Yes, agreed this goes into add_ci - the difference in svyciprop is whether the interval is calculated on the log transform, the Wald approximation, and some other options but the SE is the same. I have some time to look at this and will look at add_ci as it currently stands as well as svyciprop and confint (for CI of means).

Another opinionated piece is the degrees of freedom - I don't agree with the defaults of the survey package (Infinity) but we need to stick with them and add option of the better option which uses the degf function.

Oct 11 '22 18:10 szimmer

I agree that it is better to follow the defaults of the survey package (otherwise users get lost) and to add an option to change the behaviour.

Oct 13 '22 11:10 larmarange

I am insterested to dig in this issue. {gtsummary} looks great for my use cases, but I need CIs to adopt it. @larmarange, here is how I do this with a custom function up to now (with svyby() and not svyciprop(). It works fine for percentages by column and row , but I have no clue about how it could work for percentages by cell.

tbl_svysummary_test <- function(design, include, by) {
  var_name <- deparse(substitute(include))
  var_form <- as.formula(paste("~", var_name))
  # var_form <- as.formula(paste("~", include)) # for debugging purposes
  by_name <- deparse(substitute(by))
  by_form <- as.formula(paste("~", by_name))
  # by_form <- as.formula(paste("~", by)) # for debugging purposes
  byout <- svyby(formula = var_form, 
                 by = by_form, 
                 design = design, 
                 FUN=svymean, na.rm=TRUE, vartype="ci")
  nchar_var <- as.numeric(nchar(x)[2]) # to remove name concatenation by {survey}
  byout <- byout[,-1]
  byout <- round(byout*100, 1)
  col_est <- ncol(byout)/3
  estimates <- byout[,1:col_est]
  low_ci <- byout[,(col_est+1):(col_est*2)]
  high_ci <- byout[,(col_est*2+1):(col_est*3)]
  out <- estimates
  for (i in 1:nrow(out)) {
    out[i,] <- paste(paste(paste(estimates[i,], low_ci[i,], sep = " ["), 
                           high_ci[i,], sep="-"),"]", sep = "")
  }
  out <- t(out)
  characteristics <- substring(rownames(out), nchar_var+1)
  out %>%
    as.tibble() %>%
    mutate(Characteristic =  characteristics, .before = everything())
}

Which gives for proportions by row:

d %>%
  tbl_svysummary_test(include = stype, by = both)

Characteristic	No	Yes
E	64 [45.9-82.1]	84.2 [76.3-92.2]
H	14 [4.8-23.2]	5.3 [0.9-9.7]
M	22 [5-39]	10.5 [4.7-16.3]

For porportion by row I use the opposite tabulation

d %>%
  tbl_svysummary_test(include = both, by = stype)

Characteristic	E	H	M
o	22.2 [16.4-28]	50 [27.3-72.7]	44 [18-70]
es	77.8 [72-83.6]	50 [27.3-72.7]	56 [30-82]

Note: row labels habe been cropped in this table, it doesn't seem important at this stage.

Is there something in this approach that might help developping that feature? I'd be happy to go for a try if you think it is compattible with the current tbl_svysummary() design.

Dec 14 '22 09:12 fBedecarrats

Sorry. This issue has been opened for more than a year now.

I just drafted a PR implemented support for proportions, means and medians CIs with tbl_svysummary()

You can have a look at it: #1402

Dec 14 '22 17:12 larmarange

Sorry. This issue has been opened for more than a year now.

I just drafted a PR implemented support for proportions, means and medians CIs with tbl_svysummary()

You can have a look at it: #1402

This is great! I'll make some tests by the beginning of next week. Many thanks. From where I sit, that was very fast.

Dec 14 '22 17:12 fBedecarrats

I am testing with the survey data I am analysing (which I am not allowed to share) and it works great! I'm still struggling with the add_ci() funciton to correctly format the output, asthe syntax for the number of decimal digits to display seems quite different from the tb_summary/tb_svysummary(), but it is a different topic. MANY thanks @larmarange for the update!

Dec 16 '22 14:12 fBedecarrats

gtsummary gtsummary copied to clipboard

Add a `add_ci.tbl_svysummary()` function

gtsummary
gtsummary copied to clipboard