gtsummary icon indicating copy to clipboard operation
gtsummary copied to clipboard

Do not coerce to factor in `tbl_svysummary()`

Open ddsjoberg opened this issue 1 year ago • 9 comments

@larmarange see this SO post: https://stackoverflow.com/questions/77957551

Is this something we should address? It seems that the survey method for subset() doesn't remove rows, but puts the weights to 0, and users can't remove unobserved levels from by variables in tbl_svysummary().

ddsjoberg avatar Feb 09 '24 16:02 ddsjoberg

Because the default (t.test) is not implemented for tbl_svysummary(). You should use "smd", cf. https://www.danieldsjoberg.com/gtsummary/reference/tests.html#tbl-svysummary-add-difference-

Currently, add_difference() does not change the default tests when applied to a tbl_svysummary()

larmarange avatar Feb 09 '24 17:02 larmarange

I was thinking more about the tbl_svysummary() table itself. The unobserved columns appear in the table, even if we make the underlying column character.

library(gtsummary)
library(PNSIBGE)

pns <- get_pns(year = 2019, labels = TRUE)
pns.2 <- subset(pns, C009  %in% c("Branca", "Preta")) 
pns.2$variables$C009 <- as.character(pns.2$variables$C009)

pns.2 |> 
  gtsummary::tbl_svysummary(by = C009, include = c(C006)) |> 
  gtsummary::as_kable()
Characteristic Amarela, N = 0 Branca, N = 91,037,722 Ignorado, N = 0 Indígena, N = 0 Parda, N = 0 Preta, N = 21,786,515
C006
Homem 0 (NA%) 42,682,905 (47%) 0 (NA%) 0 (NA%) 0 (NA%) 10,691,164 (49%)
Mulher 0 (NA%) 48,354,817 (53%) 0 (NA%) 0 (NA%) 0 (NA%) 11,095,351 (51%)

But I just tried to tabulate directly with the survey package, and it still shows all levels, even when the column has previously been converted to a character.

image

So what they are dealing with is a non-standard situation, and they'd just need to write their own method in add_stat() for this, and hide the unobserved columns themselves.

ddsjoberg avatar Feb 09 '24 18:02 ddsjoberg

Probably because somewhere the levels are still declared. pns.2$variables$C009 <- as.character(pns.2$variables$C009) did not change metadata stored within the survey object.

It is much safier to use fct_drop() through srvyr::mutate()

larmarange avatar Feb 09 '24 18:02 larmarange

But a question remains open: if this is a tbl_svysummary table, should we apply, by default, a relevant test?

larmarange avatar Feb 09 '24 18:02 larmarange

Even dropping the levels with srvry, the unobserved levels appear from the survey function.

pns.2 <- 
  srvyr::as_survey_design(pns) |> 
  srvyr::filter(C009 %in% c("Branca", "Preta")) |> 
  srvyr::mutate(C009 = as.character(C009))

survey::svytable(~C009,pns.2)
#> C009
#>  Amarela   Branca Ignorado Indígena    Parda    Preta 
#>        0 91037722        0        0        0 21786515 

ddsjoberg avatar Feb 09 '24 18:02 ddsjoberg

But, yes, I better default is warrented!

ddsjoberg avatar Feb 09 '24 18:02 ddsjoberg

If I remember, as.character keeps the levels attributes, while forcats::fct_drop() remove unobserved levels.

larmarange avatar Feb 09 '24 18:02 larmarange

Same issue with forcats::fct_drop() unfortunately

ddsjoberg avatar Feb 09 '24 19:02 ddsjoberg

HI @larmarange , I am reading through this issue, and I am unclear what the next steps are for this post.

I have added another difference method based on the survey t-test in the new version FYI image

But as far as this issue is concerned, we dont remove stratifying levels with zero weights, and making that kind of change (if that is the suggestion?) would require a larger conversation about an approach.

ddsjoberg avatar Jul 01 '24 23:07 ddsjoberg

Thanks @ddsjoberg for having added svy.t.test.

Regarding the second point, it seems that srvyr::fct_drop() fails in that specific case, but this is maybe a question for the srvyr package.

If I force a new factor with just two levels, it works.

> pns.2 <- 
+   srvyr::as_survey_design(pns) |> 
+   srvyr::mutate(test = factor(C009, levels = c("Branca", "Preta"))) |> 
+   srvyr::filter(C009 %in% c("Branca", "Preta"))
> survey::svytable(~ test, pns.2)
test
  Branca    Preta 
91037722 21786515 

larmarange avatar Jul 03 '24 19:07 larmarange

I think it works when it's forced to a factor, because the unspecified levels are coerced to NA, and the "unobserved" levels are lost?

image

ddsjoberg avatar Jul 03 '24 19:07 ddsjoberg

Anyway, it seems that any perceived issue is unrelated to our implementation. I think we can close this one.

ddsjoberg avatar Jul 03 '24 19:07 ddsjoberg