gtsummary
gtsummary copied to clipboard
Add a `add_ci.tbl_svysummary()` function
Use survey::svyciprop()
to calculate CIs
Hello, I look forward to the release of this new function. For whatever it's worth, it would be great to have this function include an option to generate confidence intervals for proportions or point estimates.
Here are example tables of I created to depict confidence intervals of the:
proportion estimate (i.e., using survey::svby with function svyciprop), and
point estimate (i.e., using survey::svby with function svytotal)
hey @ddsjoberg - hope all good; just wondering if you have an estimated timeline for when might be able to implement this? We are migrating the above project to use {gtsummary} - might have some funding available if it helps (as above we considered writing minor wrappers in the meantime, but probably more efficient if we could just help push the package forward directly) Might not be helpful but as aside there is some code here for a tidy style wrapper using {srvyr} (which itself is a wrapper of {survey}... Happy to jump on a call some time if easier!
hello @trevinoj and @aspina7 ,
I don't have at timeline for this function. Perhaps I can find someone at my institutions next hackathon in April who would be interested in developing.
If either of you would like to implement, please be sure to model the function after add_ci.tbl_summary()
: perhaps, the two functions have be entirely integrated. Please know if advance, I may not have time in the next couple of months to review a pull request.
thanks @ddsjoberg - no worries, in that case we might just work on a wrapper over on our repo for the meantime and then touch base with you in april to see about helping out with add_ci.tbl_summary()
or starting a pr for review when you have more time.
@amygimma fyi
@trevinoj @aspina7 did you ever write wrappers for this type of functionality? If so, can you link to it? If not, would you be able to provide the following minimal code examples? It will be helpful when implementing.
- Code for calculating the CI for a continuous variable's mean
- Code for calculating the CI for a continuous variable's mean when there is
tbl_svysummary(by=)
present
If you can use the example data set from the tbl_svysummary()
help file, that would be easiest for me to review. Also, please keep the example as minimal as possible
hi @ddsjoberg sorry for delay in replying - piled under books again. Haven't managed getting a functioning wrapper going, but see minimal examples below. This wrapper used in the {srvyr} functions does a lot of heavy lifting for calculating confidence intervals. Having all those params in {gtsummary} would be awesome.
You already know this but doing props with CI for categorical vars is a bit of a pain in {survey} - and not entirely sure that {srvyr} has a solution for it either... see notes here
See reprex below for CIs with means comparing {survey} and {srvyr}.
library(survey)
library(srvyr)
# A dataset with a complex design
data(api, package = "survey")
## create a {survey} design object
surveyobj <- survey::svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
## create a {srvyr} design object
srvyrobj <- apiclus1 %>%
srvyr::as_survey_design(id = dnum, weights = pw, fpc = fpc)
######## CI and design effect for a continuous vars mean
## calculate the mean academic performance index (and design effect)
# using {survey}
stat <- svymean(~api00, surveyobj, na.rm = TRUE, deff = TRUE)
stats::confint(stat, level = 0.95)
#> 2.5 % 97.5 %
#> api00 598.0275 690.3113
deff(stat)
#> api00
#> 9.345869
# using {srvyr}
srvyrobj %>%
summarise(
srvyr::survey_mean(api00, vartype = "ci", deff = TRUE))
#> # A tibble: 1 x 4
#> coef `_low` `_upp` `_deff`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 644. 594. 695. 9.35
######## Stratified CI and design effect for a continuous vars mean
stat <- svyby(~api00, ~stype, surveyobj, svymean, deff = TRUE)
stats::confint(stat, level = 0.95)
#> 2.5 % 97.5 %
#> E 605.0385 692.6976
#> H 544.0531 693.0897
#> M 569.4866 693.3934
deff(stat)
#> [1] 6.583674 2.228259 2.163900
srvyrobj %>%
group_by(stype) %>%
summarise(
srvyr::survey_mean(api00, vartype = "ci", deff = TRUE)
)
#> # A tibble: 3 x 5
#> stype coef `_low` `_upp` `_deff`
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 E 649. 601. 697. 6.58
#> 2 H 619. 537. 700. 2.23
#> 3 M 631. 564. 699. 2.16
Created on 2022-04-10 by the reprex package (v2.0.1)
Thanks @aspina7 . I'm just looking at the code on my phone at the moment, but do you have code to calculate the CIs using survey pkg? We dont use svry internally and I don't want to add another dependency for the CIs. Thanks!
yep - the above shows both {survey} and {srvyr}. The {survey} examples are the best I can muster without going in to some very dense iteration. If you take a look at the {srvyr} wrapper I linked to though it uses {survey} functions.
The challenge is the computation of CI for proportions: you need to transform the variable into a binary variable and to call svyciprop()
=> which means that computation has to be done cell per cell. You also need to know if we are displaying row proportion or col proportions to apply the relevant subset()
before doing the computation.
library(survey)
#> Le chargement a nécessité le package : grid
#> Le chargement a nécessité le package : Matrix
#> Le chargement a nécessité le package : survival
#>
#> Attachement du package : 'survey'
#> L'objet suivant est masqué depuis 'package:graphics':
#>
#> dotchart
data(api)
d <- svydesign(id=~dnum, fpc=~fpc, data=apiclus1)
library(gtsummary)
d %>%
tbl_svysummary(
include = stype,
by = both
) %>%
add_overall(last = TRUE) %>%
as_kable()
Characteristic | No, N = 2,523 | Yes, N = 6,712 | Overall, N = 9,235 |
---|---|---|---|
stype | |||
E | 1,615 (64%) | 5,652 (84%) | 7,267 (79%) |
H | 353 (14%) | 353 (5.3%) | 707 (7.7%) |
M | 555 (22%) | 707 (11%) | 1,262 (14%) |
# CI for first cell
tmp <- subset(d, both == "No")
svyciprop(~I(stype == "E"), tmp)
#> 2.5% 97.5%
#> I(stype == "E") 0.640 0.424 0.81
d %>%
tbl_svysummary(
include = stype,
by = both,
percent = "row"
) %>%
add_overall(last = TRUE) %>%
as_kable()
Characteristic | No, N = 2,523 | Yes, N = 6,712 | Overall, N = 9,235 |
---|---|---|---|
stype | |||
E | 1,615 (22%) | 5,652 (78%) | 7,267 (100%) |
H | 353 (50%) | 353 (50%) | 707 (100%) |
M | 555 (44%) | 707 (56%) | 1,262 (100%) |
# CI for first cell
tmp <- subset(d, stype == "E")
svyciprop(~I(both == "No"), tmp)
#> 2.5% 97.5%
#> I(both == "No") 0.222 0.165 0.29
d %>%
tbl_svysummary(
include = stype,
by = both,
percent = "cell"
) %>%
add_overall(last = TRUE) %>%
as_kable()
Characteristic | No, N = 2,523 | Yes, N = 6,712 | Overall, N = 9,235 |
---|---|---|---|
stype | |||
E | 1,615 (17%) | 5,652 (61%) | 7,267 (79%) |
H | 353 (3.8%) | 353 (3.8%) | 707 (7.7%) |
M | 555 (6.0%) | 707 (7.7%) | 1,262 (14%) |
# CI for first cell
svyciprop(~I(stype == "E" & both == "No"), d)
#> 2.5% 97.5%
#> I(stype == "E" & both == "No") 0.175 0.126 0.24
Created on 2022-04-11 by the reprex package (v2.0.1)
yeap exactly - the binary bit is the real pain.
below is an example of a very hideously outdated basic function used for a tutorial a while back
# Function used to calculate weighted proportions, CI and design effect
svy_prop <- function(x, design) {
p1 <- round(svyciprop(as.formula(paste0( "~" , x)), design, na.rm = T) * 100, digits = 2)
p2 <- round(confint(p1) * 100, digits = 2)
p3 <- deff(round(svymean(as.formula(paste0( "~" , x)), design, na.rm = T, deff = T) * 100, digits = 2))
p4 <- cbind("Proportion" = p1, p2, "Design effect" = p3)
}
# Function used to calculate weighted proportions, CI and design effect for categorical variables
svy_prop_cat <- function(x, dataset, designids, designweight, designstrata){
#create a list to save outputs in
outer <- list()
#change each of the variables of interest to factor
dataset[,x] <- as.factor(dataset[,x])
#For each level of the categorical variable of interest
for (i in levels(dataset[,x])){
#create a new binary variable called var temp for each level
dataset[,paste0(x, "temp")] <- ifelse(dataset[,x] == i, 1, 0)
#define your design
design <- svydesign(ids = as.formula(paste0("~",designids)),
weights = as.formula(paste0("~",designweight)),
strata = as.formula(paste0("~",designstrata)),
data = dataset)
#use the svy_prop function on binary variable
overall <- svy_prop(paste0(var ,"temp"), design = design)
#label the rows of your output with the variable and category level
rownames(overall) <- paste0(var,": ", i)
#save within the list
outer[[var]][[i]] <- overall
}
#bind the output from the list in to a dataframe
do.call(rbind, outer[[1]][1:length(outer[[1]])])
}
Moving discussion from #1187 here. Yes, agreed this goes into add_ci - the difference in svyciprop is whether the interval is calculated on the log transform, the Wald approximation, and some other options but the SE is the same. I have some time to look at this and will look at add_ci as it currently stands as well as svyciprop and confint (for CI of means).
Another opinionated piece is the degrees of freedom - I don't agree with the defaults of the survey package (Infinity) but we need to stick with them and add option of the better option which uses the degf function.
I agree that it is better to follow the defaults of the survey
package (otherwise users get lost) and to add an option to change the behaviour.
I am insterested to dig in this issue. {gtsummary} looks great for my use cases, but I need CIs to adopt it.
@larmarange, here is how I do this with a custom function up to now (with svyby()
and not svyciprop()
. It works fine for percentages by column and row , but I have no clue about how it could work for percentages by cell.
tbl_svysummary_test <- function(design, include, by) {
var_name <- deparse(substitute(include))
var_form <- as.formula(paste("~", var_name))
# var_form <- as.formula(paste("~", include)) # for debugging purposes
by_name <- deparse(substitute(by))
by_form <- as.formula(paste("~", by_name))
# by_form <- as.formula(paste("~", by)) # for debugging purposes
byout <- svyby(formula = var_form,
by = by_form,
design = design,
FUN=svymean, na.rm=TRUE, vartype="ci")
nchar_var <- as.numeric(nchar(x)[2]) # to remove name concatenation by {survey}
byout <- byout[,-1]
byout <- round(byout*100, 1)
col_est <- ncol(byout)/3
estimates <- byout[,1:col_est]
low_ci <- byout[,(col_est+1):(col_est*2)]
high_ci <- byout[,(col_est*2+1):(col_est*3)]
out <- estimates
for (i in 1:nrow(out)) {
out[i,] <- paste(paste(paste(estimates[i,], low_ci[i,], sep = " ["),
high_ci[i,], sep="-"),"]", sep = "")
}
out <- t(out)
characteristics <- substring(rownames(out), nchar_var+1)
out %>%
as.tibble() %>%
mutate(Characteristic = characteristics, .before = everything())
}
Which gives for proportions by row:
d %>%
tbl_svysummary_test(include = stype, by = both)
Characteristic | No | Yes |
---|---|---|
E | 64 [45.9-82.1] | 84.2 [76.3-92.2] |
H | 14 [4.8-23.2] | 5.3 [0.9-9.7] |
M | 22 [5-39] | 10.5 [4.7-16.3] |
For porportion by row I use the opposite tabulation
d %>%
tbl_svysummary_test(include = both, by = stype)
Characteristic | E | H | M |
---|---|---|---|
o | 22.2 [16.4-28] | 50 [27.3-72.7] | 44 [18-70] |
es | 77.8 [72-83.6] | 50 [27.3-72.7] | 56 [30-82] |
Note: row labels habe been cropped in this table, it doesn't seem important at this stage.
Is there something in this approach that might help developping that feature? I'd be happy to go for a try if you think it is compattible with the current tbl_svysummary()
design.
Sorry. This issue has been opened for more than a year now.
I just drafted a PR implemented support for proportions, means and medians CIs with tbl_svysummary()
You can have a look at it: #1402
Sorry. This issue has been opened for more than a year now.
I just drafted a PR implemented support for proportions, means and medians CIs with
tbl_svysummary()
You can have a look at it: #1402
This is great! I'll make some tests by the beginning of next week. Many thanks. From where I sit, that was very fast.
I am testing with the survey data I am analysing (which I am not allowed to share) and it works great! I'm still struggling with the add_ci() funciton to correctly format the output, asthe syntax for the number of decimal digits to display seems quite different from the tb_summary/tb_svysummary(), but it is a different topic. MANY thanks @larmarange for the update!