janitor icon indicating copy to clipboard operation
janitor copied to clipboard

weights for tabyl

Open tklebel opened this issue 6 years ago • 22 comments

Feature requests: weights

It would be great, if one could specify weights when using tabyl. For survey data weights are very common, without them using tabyl does not make much sense in this case.

If the implementation is more or less straightforward and within the scope of your package, I would be happy to assist with a pull. I looked at the source for tabyl and it seems to me that passing down an argument to dplyr::count for wt like the following should be enough?

library(dplyr)

test_df <- tribble(~x, ~wt,
                   "a", 1,
                   "a", 1,
                   "b", .5,
                   "b", .5)

test_df %>% 
  count(x)
#> # A tibble: 2 x 2
#>   x         n
#>   <chr> <int>
#> 1 a         2
#> 2 b         2

test_df %>% 
  count(x, wt = wt)
#> # A tibble: 2 x 2
#>   x         n
#>   <chr> <dbl>
#> 1 a      2.00
#> 2 b      1.00

Or is there something I am overlooking?

tklebel avatar Feb 20 '18 11:02 tklebel

Thanks for this clearly stated feature request! I have been thinking about it and feel unsure. I see questions of fit and implementation.

Fit Someone else has mentioned wanting this before and it seems like a useful feature to some users. Janitor's boundaries aren't very crisp, but while tabyl could be seen as a data cleaning tool (e.g., exploring a variable), this starts to be more of a purely data analysis feature. I don't know of a tidy tools package for survey data specifically, which might be the perfect home for this.

Implementation You're right about modifying dplyr::count to use the wt argument. This change would be simple in janitor 0.3.1 but more complex in the new version, because tabyl now includes the former function crosstab and a 3-way version as well. Do/would you use weighting in those contexts, that is, counts of 2-3 variables with weighting by another one? So say with dplyr, count(mtcars, cyl, am, wt = mpg). I ask genuinely and would love to hear from anyone on this, as I don't use weighting like this in my own analyses.

If it makes sense and would be used in the 2- and 3- variable contexts, then I think it's implementable. It's yet another argument to tabyl but that's an acceptable trade-off if enough people would use it.

Would love to hear from other users and janitor contributors/stakeholders!

sfirke avatar Feb 22 '18 03:02 sfirke

I have to admit, I am not an expert when it comes to survey weights, but from what I gather, the point is as follows:

In surveys you might have at least two variants of weights: design weights to counter over-sampling of different sub-populations, and post-stratification weights to counter nonresponse among other things. Especially when counting and cross-tabulating, those weights need to be incorporated, because percentages would otherwise not be representative for the population.

For weighting this would mean that you sum the weights per group, as dplyr::count does, instead of simply counting categories. For your question then, this would mean, yes, for counts of 2-3 variables you would wish to weight by another variable. But it seems not to be as straightforward, as I thought before. The total n of a crosstable, for example, should not be the sum of the weights, but still the unweighted n, since the sample size stays the same, whether you weight the cases or not. This however would probably be difficult to implement in janitor, since the adorn_totals function would need to be changed as well.

All in all, it seems to me that a separate package for tidy survey analysis, or at least for for survey crosstabs would be a better fit than your package. It just feels a pity to start within one package (janitor) to explore variables (with a syntax which works great) and then need to move to something else for correct counts. From the user perspective weighting should be a simple task (like in SPSS you simply "throw a switch" and "it works", although as usual it is tough to know, what SPSS really does). But for implementation, it is probably not as straightforward.

tklebel avatar Feb 22 '18 11:02 tklebel

I was just looking at the questionr package, for survey analysis, and it has a function similar to tabyl but supporting weighting. Perhaps the answer here is to tackle the survey analysis with questionr which is specifically survey-oriented?

sfirke avatar Mar 08 '18 03:03 sfirke

The problem with the questionr package is not that it's not tidyverse friendly, so it doesn't really answer to the need for a version of tabyl() that allows weighting. I've started using the janitor package, and find it very useful, but for my final analysis which requires weights, I'm unfortunately back to having to create my own ad hoc functions.

ghost avatar Jul 18 '18 17:07 ghost

I'd second that weighting would be useful. You've already taken a step into analysis in implementing the chisq.test() and fisher.test() functions. Why not allow for weighting too?

jackobailey avatar Apr 30 '19 09:04 jackobailey

Is srvyr an acceptable, tidyverse alternative? https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html

sfirke avatar Apr 30 '19 13:04 sfirke

I've lately been using wt argument in dplyr::count() to set up my own version of tabyl(). I've found it more flexible than using survey/srvyr.

ghost avatar Apr 30 '19 21:04 ghost

for what it's worth, here's something I wrote that uses count() and then some adorn_ functions later, to get around the weighting issue:

xtab_1v <- function(data, dv, iv, weight = NULL) {
  data %>% 
    mutate_at(vars(!!dv, !!iv), forcats::fct_explicit_na) %>% 
    {
      if (is.null(weight)) count(., !!sym(dv), !!sym(iv)) 
      else count(., !!sym(dv), !!sym(iv), wt = !!sym(weight))
    } %>% 
    group_by(!!sym(dv)) %>% 
    mutate(pct = n / sum(n)) %>% 
    ungroup() %>% 
    select(-n) %>% 
    spread(!!sym(iv), pct) %>% 
    janitor::adorn_pct_formatting() %>% 
    janitor::adorn_ns(ns = {
      data %>% 
        mutate_at(vars(!!dv, !!iv), forcats::fct_explicit_na) %>% 
        {
          if (is.null(weight)) count(., !!sym(dv), !!sym(iv)) 
          else count(., !!sym(dv), !!sym(iv), wt = !!sym(weight))
        } %>% 
        mutate(n = round(n, 1)) %>% 
        spread(!!sym(iv), n, fill = 0)
    }) %>% 
    mutate(variable = dv) %>% 
    rename("value" = dv) %>% 
    `[`(TRUE, c(length(.), 1:(length(.) - 1)))
}

We can try with:

set.seed(1839)
data <- data.frame(
  x = sample(letters[1:2], 200, TRUE), 
  y = sample(letters[3:4], 200, TRUE), 
  weight = runif(200)
)
xtab_1v(data, "x", "y", "weight")

Which returns:

# A tibble: 2 x 4
  variable value c            d           
  <chr>    <fct> <chr>        <chr>       
1 x        a     48.7% (27.3) 51.3% (28.8)
2 x        b     48.8% (22.1) 51.2% (23.2)

As always, be very careful when using weights in R. Make sure you know what you're doing, as there are many different types of weights out there. This function should be fine, since where the real issues come up are when calculating standard errors.

markhwhiteii avatar Jun 10 '19 15:06 markhwhiteii

I do agree. This is a good function and a good package, tidyverse compatible, and the adorn stuff is perfect. But as I work with data coming from the french public statistic system, everything is weighted and I really cannot use it ! Weights for tabyl would be great !

jcthrawn avatar Feb 26 '20 17:02 jcthrawn

Perhaps the srvyr package could help with this?

library(tidyverse)
library(srvyr)

set.seed(1839)
dat <- tibble(
  x = factor(rbinom(200, 1, .5)), 
  y = factor(rbinom(200, 1, .5)), 
  w = runif(200)
)

dat %>% 
  as_survey() %>% 
  group_by(x, y) %>% 
  summarise(pct = survey_mean())
# A tibble: 4 x 4
  x     y       pct pct_se
  <fct> <fct> <dbl>  <dbl>
1 0     0     0.491 0.0478
2 0     1     0.509 0.0478
3 1     0     0.5   0.0528
4 1     1     0.5   0.0528

markhwhiteii avatar Feb 27 '20 03:02 markhwhiteii

Hi all, I would like to second the requests for adding a weights option for tabyl. I'm a novice user of R trying to convince my colleagues to pick up R for it's many advantages including reproducibility.
However, something simple like making a 2 X 2 table with % with weights (dpylr compataible) isn't quite so straightforward in R to a rookie.

Thanks for considering.

olanderb avatar Feb 27 '20 09:02 olanderb

I am open to adding this feature, if people are confident it's not already implemented in another package (e.g., @markhwhiteii gives the example from srvyr last week) or if it would be worthwhile to add it here even if it exists elsewhere. In short it would add an argument "wt" to tabyl() and weight the according 1-, 2-, or 3- way tabyls accordingly. (yes?)

I don't have the time to implement this feature, though, so someone would need to own that. Design what users want & what will be able to be implemented, then create the code and tests on a fork and submit a pull request. I can advise and give some feedback, especially where it relates to the internal code of tabyl (which is approachable, I don't use anything ultra advanced, but it makes more sense after you get oriented to it, especially if you're not familiar with S3 methods).

I also don't know much about weighting in analysis so would need that perspective represented as well.

sfirke avatar Mar 03 '20 03:03 sfirke

Hi everybody,

I can be wrong because I didn’t test it a long time, but survey/srvyr doesn’t seem to bring much things more for weighted tables than dplyr. We can simply do :

tab <- dat %>% group_by(var1, var2) %>% summarize(n=sum(wt)) %>% spread(var2, n)

It would be OK for me to make my own functions and to use tabyl::adorn, which I really love, on the result. But I also think it’s better when it’s straightforward and an absolute beginner can easily read the code, especially when we talk about something as basic as a crosstab. To teach statistics to students of social sciences who know nothing about programming, I am currently using SAS and would love to turn to R, because it’s free and works better : but I hesitate when I see all that you need to code to produce, format and export a simple tab. You have created an efficient and easy function to make formatted crosstabs in a tidy way, it would be a pity if half the world cannot use it because he needs weights ! And we cannot pipe any weighted data into the tabyl function : it have to be done when you count the frequencies. A simple argument "wt" for all the 1-, 2- and 3- tabyls would effectively be enough. Since a weight variable is simply numeric, generally don’t have any NAs, and you just have to do the multiplication once, it musn’t create many problems (something a bit more complicated would be an option to do normalized weights, which ensures that the total number of individuals doesn’t change / that the sum of weights are equals to n ; but I think many of us don’t really need this, at least with representative samples weighted to obtain the overall population number).

I think I unfortunately cannot implement the feature myself, because I am really programming with R since only two weeks, I still often produce quite "ugly" code and don’t know what S3 methods are ! But I can test it with different data and give feedback.

jcthrawn avatar Mar 04 '20 08:03 jcthrawn

I want to check my understanding here. Would the wt column be a numeric vector present in the input data.frame, and at the stage of making the counts (of either 1 variable or combinations of 2 variables), we would multiply the counts by the wt variable? Then everything else would proceed like it already does - the adorn_ functions, etc?

sfirke avatar Aug 30 '21 15:08 sfirke

For a minimal weight argument that's it : the counts just have to be multiplied once, and all other functions can just stay the same (if the user knows he can't do chisq tests or calculate confidence intervals over a weighted table : but putting both weighted and unweighted results in attributes, and modify functions to access the base counts, won’t be needed because it's not the point of janitor).

BriceNocenti avatar Aug 30 '21 16:08 BriceNocenti

The easiest solution I came up with was just to use the wt argument of the dplyr::count() function (as mentioned above by @tklebel and others). I had to duplicate parts of the code for calling count() with and without the wt argument (checking with missing()). Duplicating the code feels a bit unelegant, but at least it seems to work: I crosschecked the results with the table command and fweight in Stata and the results are identical, even for the 3-way-tables.

I am sure my solution needs some testing and can be improved (checking for NAs for example) but I think that's pretty straight forward. Implementing normalized weights (as suggested by @jcthrawn) should be pretty easy as well. And since I basically just added another call of count() I hopefully did not break anything else...

I am a bit hesitant to just open a pull request since I did only a very limited amount of testing. You could perhaps look at the commit in my fork and decide if there should be any changes beforehand: gdutz/janitor@61540bf44fcf506e1d12a36c525e7b38ffc6409a

gdutz avatar Aug 31 '21 11:08 gdutz

Just wanted to chime in that I would love if this package could replace my haphazard personal functions I use to generate weighted tables and crosstabs. When I work with data, it's almost always weighted. (I get the feeling that those who use weights, use them a lot.) I am able to generate my weights, but having them come through in a one-line call is important. I would like to jump on the bandwagon for this to be implemented, if there is a wagon to get on.

Sopwith avatar Apr 05 '22 18:04 Sopwith

I'd like to add another request that a weight option be added to tabyl. We wrote our own function to do similar crosstabs with weights but would love to just recommend that folks use the otherwise excellent tabyl.

The example by @gdutz in https://github.com/sfirke/janitor/issues/183#issuecomment-909129910 seems to work well.

benzipperer avatar Dec 05 '22 19:12 benzipperer

I support the request to add support for weights. I am a university statistics professor and with this feature I could fully recommend the use of tabyl.

sirojasv avatar Sep 12 '23 17:09 sirojasv

Thank you! Great package. Only wish it had weights.

jschmittwdc avatar Nov 06 '23 12:11 jschmittwdc

I would like to reiterate the request to add a weight feature to this package :) It would avoid a lot of duplicated code using counts. Thank you

SimonEdscer avatar Jan 25 '24 14:01 SimonEdscer

Just thought I would add my support for this as well. Tabyl is one of the best options I've found for making tabs and crosstabs. It would be much more useful if we could use weights.

dakotainstitute avatar Feb 08 '24 15:02 dakotainstitute