tidybayes icon indicating copy to clipboard operation
tidybayes copied to clipboard

Decide what to do about dropping implicit variables in compose_data

Open aornugent opened this issue 7 years ago • 7 comments

Following #58, I wonder if there is scope for a general syntax for adding derived quantities in compose_data().

An easy example is censoring: often it is necessary to pass in a count of the number of censored values, and in JAGS an indicator of whether a value is censored or not. This isn't too hard to do at present:

df %>%
  compose_data(is_censored = ifelse(.$y == 0, 1, 0),
               n_censored = sum(.$y == 0))

The only ugly part is the need for .$y as opposed to y. I guess an alternative is to create an is_censored column in df, but it is usually something I do when passing in my data.

Interested to see what you think.

Thanks, A

aornugent avatar Aug 17 '17 10:08 aornugent

I'm not sure if this is the same thing, but another example would be declaring covariate matrices in Stan. Currently something like this works:

df %>% 
  compose_data(X = model.matrix(~ 1 + covariate_1 + covariate_2, data = .))

but is a little clunky and again leaves duplicated covariate columns.

I might have missed it in the docs, but I'm not sure how multivariate (matrix) data work either:

df %>%
  compose_data(y = as.matrix(.$species_1 : .$species_10)) #doesn't work

Cheers, A

aornugent avatar Aug 17 '17 13:08 aornugent

These are great, thanks!

1. Censoring

Along with #58 I changed compose_data to execute each argument in an environment that includes the variables generated by the previous arguments. This is similar to how dplyr::mutate works. So this works:

compose_data(
  a = 1,
  b = a + 1,
  c = b + 1
)
$a
[1] 1

$b
[1] 2

$c
[1] 3

Similarly the censoring example now works without the .$ prefix (also auto-conversion to int from logical in compose_data means you don't need the ifelse, and the new derived-value setup of compose_data means we might as well use is_censored to compute n_censored to avoid duplicating the censoring condition in case it changes):

df = data_frame(y = 0:10)

df %>%
  compose_data(
    is_censored = y == 0,
    n_censored = sum(is_censored)
  )
$y
 [1]  0  1  2  3  4  5  6  7  8  9 10

$n
[1] 11

$is_censored
 [1] 1 0 0 0 0 0 0 0 0 0 0

$n_censored
[1] 1

2. Model matrices

I'm not sure exactly what to do for the model matrix example that would improve it. Is there something in the output from that you would want that is not there already (besides not having the columns from the data frame)? Matrix dimensions come to my mind as a possibility.

3. Matrices in general

I'm not sure exactly what output you were looking for in the last example, could you give me an example of the ideal output here? (Maybe for #2 as well).

So far I've punted on doing anything special with matrices in compose_data, and just passed them along as-is on the assumption that people would want to use them for pre-processed data (like model matrices), so they would not require additional transformation. However I'm willing to be convinced otherwise :).

mjskay avatar Aug 17 '17 15:08 mjskay

Censoring looks great, thanks! Converting to logical is a nice touch.

Matrices:

These examples come up together because I have data declarations for Stan that often look like this:

data {
  int<lower=1> N; // # obs
  int<lower=1> S; // # species
  int<lower=1> K; // # covariates
  vector[K] X[N]; // covariates
  vector<lower=0>[S] y[N]; // observed data
}
...

where X is the model matrix example and y is what I was trying to achieve with the multivariate data example. You're right that the model.matrix setup works more or less as desired, but dimensions would sure help.

As for the multivariate data, I currently (without tidybayes) have to convert a dataset from long to wide, then to a matrix.

y <- select(cover, id, species, cover) %>%
  group_by(id, species) %>%
  summarise(cover = sum(cover)) %>%
  spread(species, cover, fill = 0)

data_list <- list(
  y = as.matrix(y[ . -1]), # drop index
  N = nrow(y),
  S = ncol(y) - 1)
}

This is a nuisance because it involves separating the observation data from the covariates, which in turn need reducing to the same dimensions as the wide format.

Where I imagined tidybayes streamlining this was if I had the full dataframe in wide format (covariates and all), then could index the columns that formed my matrices:

df %>%
  compose_data(y = select(column2 : column10),
  X = model.matrix(~ 1 + column11 + column12),
  n = nrow(y),  # better still to get the dimensions automatically.
  s = ncol(y),
  k = ncol(X)
)

Perhaps this is a pedantic use case, but it would cut my setup from 17 lines to 3 (ideally) and simplify post-processing.

Thanks for the support! A

aornugent avatar Aug 25 '17 13:08 aornugent

Okay cool, this helps clarify! I need to look at this in more detail, but initially looking at this sort of thing:

df %>% compose_data(
  y = select(column2 : column10),
  X = model.matrix(~ 1 + column11 + column12),
  n = nrow(y),
  s = ncol(y),
  k = ncol(X)
)

Setting aside the auto-generation of dimensions for the moment, assuming you had the data in wide format already, if you run the following on the latest version of tidybayes does it give the output you want?

df %>% compose_data(
  y = as.matrix(select(., column2 : column10)),
  X = model.matrix(~ 1 + column11 + column12, data = .),
  s = ncol(y),
  k = ncol(X)
)

I'm not saying this is the best syntax, I just wanted to confirm I understand the desired output.

Some points:

  • n should be auto-generated from nrow(df) already, which should be equal to nrow(y) in the above, so I omitted it.

  • The . is required to refer specifically to df in the definitions of y and X because compose_data executes the subsequent definitions in an environment containing the elements of the data list you are generating, not in the context of the first argument passed to it (df in this case). That approach is what allows the censoring example to work, and (I think) is more consistent with what compose_data is meant to do --- the alternative, of having subsequent parameters execute in the environment of the first data frame, would be useful for tasks that dplyr::mutate (and other dplyr functions) are already designed for, and I'm not sure it would make sense to duplicate that functionality here. One could always put a mutate call in the pipeline if needed.

The low-hanging fruit to pull out from the above is auto-generating dimensions. To set that up generically I would have it add columns named something like "nrow_X", "ncol_X", "nrow_y", "ncol_y", which would at least allow you to do (assuming you are fine with having to rename the corresponding indices in the model):

df %>% compose_data(
  y = as.matrix(select(., column2 : column10)),
  X = model.matrix(~ 1 + column11 + column12, data = .)
)

I think the definition for X is about as concise as it can be without redefining how compose_data does lazy evaluation in a way that will make other use cases more cumbersome (like censoring). As is, the definition of X is down to a single---pretty clear in intent---function call. So I might punt on that for now.

Let me think a little more about the "generating a matrix from specific columns" task for y. I think the code there is reasonably concise as well, but there could a more generic class of tasks related to it that is worth specialized syntax. The only thing I can think of off the top of my head is something like matrix_from_columns(., column2 : column10), but it is not clear to me that that is enough clearer or more concise than as.matrix(select(., column2 : column10)) to warrant specialized syntax.

Thoughts/feedback on any of this is appreciated!

mjskay avatar Aug 25 '17 18:08 mjskay

I might have missed it in the docs, but I'm not sure how multivariate (matrix) data work either:

I was close! Must mean you've got an intuitive interface.

df %>% compose_data(
  y = as.matrix(select(., column2 : column10)),
  X = model.matrix(~ 1 + column11 + column12, data = .),
  s = ncol(y),
  k = ncol(X)
)

This does what I want. Manually generating column dimensions is no extra overhead, and naming them is often helpful to remember what they represent.

Two minor points:

  • keeping all the remaining dataframe columns clutters things up (think mutate vs. summarise), but I think this only matters when assigning and checking the output of compose_data.
  • if df is grouped, it drags the grouping variable through to y which (in this case) ends up as a factor, but it does fail loudly and isn't hard to diagnose/debug.

Cheers, A

aornugent avatar Aug 26 '17 09:08 aornugent

Awesome, glad it's working! And thanks so much for the feedback and interest --- it has helped push towards a better API for sure.

keeping all the remaining dataframe columns clutters things up (think mutate vs. summarise), but I think this only matters when assigning and checking the output of compose_data.

This in particular has been something I go back and forth on. I think the latest versions of stan no longer complain when you pass in data lists that have extra variables not present in the model, so I think this is less of an issue than it used to be. I think stan used to complain about extra variables, either that or I might be thinking of one of the JAGS interfaces. Does one of the interfaces you use complain about extra unused variables in the data list? That would help me know what to test on.

Two other solutions come to mind:

  1. (Does not require new syntax) Using [ or magrittr::extract to subset the list to just the elements you want:

    df %>% compose_data(
        y = as.matrix(select(., column2 : column10)),
        X = model.matrix(~ 1 + column11 + column12, data = .),
        s = ncol(y),
        k = ncol(X)
      ) %>%
      `[`(c("y", "X", "s", "k", "n"))
    

    That feels less than ideal, because it makes you list all the names twice.

  2. (Would require new syntax) Having an option to drop implicit/unnamed elements from the result. You would have to add the auto-generated columns back in to keep them, but you wouldn't have to redefine them. This is possibly nice if you are basically just using compose_data for the data type conversion but not the column generation:

    df %>% compose_data(
        y = as.matrix(select(., column2 : column10)),
        X = model.matrix(~ 1 + column11 + column12, data = .),
        s = ncol(y),
        k = ncol(X),
        n,         # need to make `n` explicit so it is kept, but no need to fully
                   # define it (as in n = nrow(df)) since it is still auto-generated
        .drop_implicit = TRUE
      )
    

    I think adding such an option would not require large changes to compose_data, so if that is something you would find useful I would consider it.

if df is grouped, it drags the grouping variable through to y which (in this case) ends up as a factor, but it does fail loudly and isn't hard to diagnose/debug.

This also comes up in some other parts of the tidybayes API, because spread_samples and other functions return pre-grouped data. My current solution is to accept that this sort of thing happens (in spread_samples, the alternative of forcing people to specify obvious groupings is more repetitive and tedious) and hope the loud failures are enough to prompt people to add ungroup() to their pipelines when needed.

mjskay avatar Aug 26 '17 18:08 mjskay

no longer complain when you pass in data lists that have extra variables not present in the model.

I've not had a problem with Stan or JAGS, so this isn't pressing by any means. It just looks ugly if you stop to inspect the output of compose_data.

That said, I do like the .drop_implicit flag. The magrittr magic is ok too, but the implicit columns are only really a problem when there's large differences from the default behaviour anyway (ie. lots to name).

aornugent avatar Aug 27 '17 13:08 aornugent