janitor
janitor copied to clipboard
weights for tabyl
Feature requests: weights
It would be great, if one could specify weights when using tabyl
. For survey data weights are very common, without them using tabyl
does not make much sense in this case.
If the implementation is more or less straightforward and within the scope of your package, I would be happy to assist with a pull. I looked at the source for tabyl
and it seems to me that passing down an argument to dplyr::count
for wt
like the following should be enough?
library(dplyr)
test_df <- tribble(~x, ~wt,
"a", 1,
"a", 1,
"b", .5,
"b", .5)
test_df %>%
count(x)
#> # A tibble: 2 x 2
#> x n
#> <chr> <int>
#> 1 a 2
#> 2 b 2
test_df %>%
count(x, wt = wt)
#> # A tibble: 2 x 2
#> x n
#> <chr> <dbl>
#> 1 a 2.00
#> 2 b 1.00
Or is there something I am overlooking?
Thanks for this clearly stated feature request! I have been thinking about it and feel unsure. I see questions of fit and implementation.
Fit Someone else has mentioned wanting this before and it seems like a useful feature to some users. Janitor's boundaries aren't very crisp, but while tabyl could be seen as a data cleaning tool (e.g., exploring a variable), this starts to be more of a purely data analysis feature. I don't know of a tidy tools package for survey data specifically, which might be the perfect home for this.
Implementation
You're right about modifying dplyr::count
to use the wt
argument. This change would be simple in janitor 0.3.1 but more complex in the new version, because tabyl now includes the former function crosstab
and a 3-way version as well. Do/would you use weighting in those contexts, that is, counts of 2-3 variables with weighting by another one? So say with dplyr, count(mtcars, cyl, am, wt = mpg)
. I ask genuinely and would love to hear from anyone on this, as I don't use weighting like this in my own analyses.
If it makes sense and would be used in the 2- and 3- variable contexts, then I think it's implementable. It's yet another argument to tabyl but that's an acceptable trade-off if enough people would use it.
Would love to hear from other users and janitor contributors/stakeholders!
I have to admit, I am not an expert when it comes to survey weights, but from what I gather, the point is as follows:
In surveys you might have at least two variants of weights: design weights to counter over-sampling of different sub-populations, and post-stratification weights to counter nonresponse among other things. Especially when counting and cross-tabulating, those weights need to be incorporated, because percentages would otherwise not be representative for the population.
For weighting this would mean that you sum the weights per group, as dplyr::count
does, instead of simply counting categories. For your question then, this would mean, yes, for counts of 2-3 variables you would wish to weight by another variable. But it seems not to be as straightforward, as I thought before. The total n
of a crosstable, for example, should not be the sum of the weights, but still the unweighted n
, since the sample size stays the same, whether you weight the cases or not. This however would probably be difficult to implement in janitor, since the adorn_totals
function would need to be changed as well.
All in all, it seems to me that a separate package for tidy survey analysis, or at least for for survey crosstabs would be a better fit than your package. It just feels a pity to start within one package (janitor) to explore variables (with a syntax which works great) and then need to move to something else for correct counts. From the user perspective weighting should be a simple task (like in SPSS you simply "throw a switch" and "it works", although as usual it is tough to know, what SPSS really does). But for implementation, it is probably not as straightforward.
I was just looking at the questionr package, for survey analysis, and it has a function similar to tabyl
but supporting weighting. Perhaps the answer here is to tackle the survey analysis with questionr which is specifically survey-oriented?
The problem with the questionr package is not that it's not tidyverse friendly, so it doesn't really answer to the need for a version of tabyl()
that allows weighting. I've started using the janitor package, and find it very useful, but for my final analysis which requires weights, I'm unfortunately back to having to create my own ad hoc functions.
I'd second that weighting would be useful. You've already taken a step into analysis in implementing the chisq.test()
and fisher.test()
functions. Why not allow for weighting too?
Is srvyr
an acceptable, tidyverse alternative? https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html
I've lately been using wt
argument in dplyr::count()
to set up my own version of tabyl()
. I've found it more flexible than using survey
/srvyr
.
for what it's worth, here's something I wrote that uses count()
and then some adorn_
functions later, to get around the weighting issue:
xtab_1v <- function(data, dv, iv, weight = NULL) {
data %>%
mutate_at(vars(!!dv, !!iv), forcats::fct_explicit_na) %>%
{
if (is.null(weight)) count(., !!sym(dv), !!sym(iv))
else count(., !!sym(dv), !!sym(iv), wt = !!sym(weight))
} %>%
group_by(!!sym(dv)) %>%
mutate(pct = n / sum(n)) %>%
ungroup() %>%
select(-n) %>%
spread(!!sym(iv), pct) %>%
janitor::adorn_pct_formatting() %>%
janitor::adorn_ns(ns = {
data %>%
mutate_at(vars(!!dv, !!iv), forcats::fct_explicit_na) %>%
{
if (is.null(weight)) count(., !!sym(dv), !!sym(iv))
else count(., !!sym(dv), !!sym(iv), wt = !!sym(weight))
} %>%
mutate(n = round(n, 1)) %>%
spread(!!sym(iv), n, fill = 0)
}) %>%
mutate(variable = dv) %>%
rename("value" = dv) %>%
`[`(TRUE, c(length(.), 1:(length(.) - 1)))
}
We can try with:
set.seed(1839)
data <- data.frame(
x = sample(letters[1:2], 200, TRUE),
y = sample(letters[3:4], 200, TRUE),
weight = runif(200)
)
xtab_1v(data, "x", "y", "weight")
Which returns:
# A tibble: 2 x 4
variable value c d
<chr> <fct> <chr> <chr>
1 x a 48.7% (27.3) 51.3% (28.8)
2 x b 48.8% (22.1) 51.2% (23.2)
As always, be very careful when using weights in R. Make sure you know what you're doing, as there are many different types of weights out there. This function should be fine, since where the real issues come up are when calculating standard errors.
I do agree. This is a good function and a good package, tidyverse compatible, and the adorn stuff is perfect. But as I work with data coming from the french public statistic system, everything is weighted and I really cannot use it ! Weights for tabyl would be great !
Perhaps the srvyr
package could help with this?
library(tidyverse)
library(srvyr)
set.seed(1839)
dat <- tibble(
x = factor(rbinom(200, 1, .5)),
y = factor(rbinom(200, 1, .5)),
w = runif(200)
)
dat %>%
as_survey() %>%
group_by(x, y) %>%
summarise(pct = survey_mean())
# A tibble: 4 x 4
x y pct pct_se
<fct> <fct> <dbl> <dbl>
1 0 0 0.491 0.0478
2 0 1 0.509 0.0478
3 1 0 0.5 0.0528
4 1 1 0.5 0.0528
Hi all,
I would like to second the requests for adding a weights option for tabyl.
I'm a novice user of R trying to convince my colleagues to pick up R for it's many advantages including reproducibility.
However, something simple like making a 2 X 2 table with % with weights (dpylr compataible) isn't quite so straightforward in R to a rookie.
Thanks for considering.
I am open to adding this feature, if people are confident it's not already implemented in another package (e.g., @markhwhiteii gives the example from srvyr
last week) or if it would be worthwhile to add it here even if it exists elsewhere. In short it would add an argument "wt"
to tabyl()
and weight the according 1-, 2-, or 3- way tabyls accordingly. (yes?)
I don't have the time to implement this feature, though, so someone would need to own that. Design what users want & what will be able to be implemented, then create the code and tests on a fork and submit a pull request. I can advise and give some feedback, especially where it relates to the internal code of tabyl (which is approachable, I don't use anything ultra advanced, but it makes more sense after you get oriented to it, especially if you're not familiar with S3 methods).
I also don't know much about weighting in analysis so would need that perspective represented as well.
Hi everybody,
I can be wrong because I didn’t test it a long time, but survey/srvyr doesn’t seem to bring much things more for weighted tables than dplyr. We can simply do :
tab <- dat %>% group_by(var1, var2) %>% summarize(n=sum(wt)) %>% spread(var2, n)
It would be OK for me to make my own functions and to use tabyl::adorn, which I really love, on the result. But I also think it’s better when it’s straightforward and an absolute beginner can easily read the code, especially when we talk about something as basic as a crosstab. To teach statistics to students of social sciences who know nothing about programming, I am currently using SAS and would love to turn to R, because it’s free and works better : but I hesitate when I see all that you need to code to produce, format and export a simple tab. You have created an efficient and easy function to make formatted crosstabs in a tidy way, it would be a pity if half the world cannot use it because he needs weights ! And we cannot pipe any weighted data into the tabyl function : it have to be done when you count the frequencies. A simple argument "wt" for all the 1-, 2- and 3- tabyls would effectively be enough. Since a weight variable is simply numeric, generally don’t have any NAs, and you just have to do the multiplication once, it musn’t create many problems (something a bit more complicated would be an option to do normalized weights, which ensures that the total number of individuals doesn’t change / that the sum of weights are equals to n ; but I think many of us don’t really need this, at least with representative samples weighted to obtain the overall population number).
I think I unfortunately cannot implement the feature myself, because I am really programming with R since only two weeks, I still often produce quite "ugly" code and don’t know what S3 methods are ! But I can test it with different data and give feedback.
I want to check my understanding here. Would the wt
column be a numeric vector present in the input data.frame, and at the stage of making the counts (of either 1 variable or combinations of 2 variables), we would multiply the counts by the wt
variable? Then everything else would proceed like it already does - the adorn_
functions, etc?
For a minimal weight argument that's it : the counts just have to be multiplied once, and all other functions can just stay the same (if the user knows he can't do chisq tests or calculate confidence intervals over a weighted table : but putting both weighted and unweighted results in attributes, and modify functions to access the base counts, won’t be needed because it's not the point of janitor
).
The easiest solution I came up with was just to use the wt
argument of the dplyr::count()
function (as mentioned above by @tklebel and others). I had to duplicate parts of the code for calling count()
with and without the wt
argument (checking with missing()
). Duplicating the code feels a bit unelegant, but at least it seems to work: I crosschecked the results with the table
command and fweight
in Stata and the results are identical, even for the 3-way-tables.
I am sure my solution needs some testing and can be improved (checking for NAs for example) but I think that's pretty straight forward. Implementing normalized weights (as suggested by @jcthrawn) should be pretty easy as well. And since I basically just added another call of count()
I hopefully did not break anything else...
I am a bit hesitant to just open a pull request since I did only a very limited amount of testing. You could perhaps look at the commit in my fork and decide if there should be any changes beforehand: gdutz/janitor@61540bf44fcf506e1d12a36c525e7b38ffc6409a
Just wanted to chime in that I would love if this package could replace my haphazard personal functions I use to generate weighted tables and crosstabs. When I work with data, it's almost always weighted. (I get the feeling that those who use weights, use them a lot.) I am able to generate my weights, but having them come through in a one-line call is important. I would like to jump on the bandwagon for this to be implemented, if there is a wagon to get on.
I'd like to add another request that a weight option be added to tabyl. We wrote our own function to do similar crosstabs with weights but would love to just recommend that folks use the otherwise excellent tabyl.
The example by @gdutz in https://github.com/sfirke/janitor/issues/183#issuecomment-909129910 seems to work well.
I support the request to add support for weights. I am a university statistics professor and with this feature I could fully recommend the use of tabyl.
Thank you! Great package. Only wish it had weights.
I would like to reiterate the request to add a weight feature to this package :) It would avoid a lot of duplicated code using counts. Thank you
Just thought I would add my support for this as well. Tabyl is one of the best options I've found for making tabs and crosstabs. It would be much more useful if we could use weights.