tidybayes
tidybayes copied to clipboard
Decide what to do about dropping implicit variables in compose_data
Following #58, I wonder if there is scope for a general syntax for adding derived quantities in compose_data()
.
An easy example is censoring: often it is necessary to pass in a count of the number of censored values, and in JAGS an indicator of whether a value is censored or not. This isn't too hard to do at present:
df %>%
compose_data(is_censored = ifelse(.$y == 0, 1, 0),
n_censored = sum(.$y == 0))
The only ugly part is the need for .$y
as opposed to y
. I guess an alternative is to create an is_censored
column in df
, but it is usually something I do when passing in my data.
Interested to see what you think.
Thanks, A
I'm not sure if this is the same thing, but another example would be declaring covariate matrices in Stan. Currently something like this works:
df %>%
compose_data(X = model.matrix(~ 1 + covariate_1 + covariate_2, data = .))
but is a little clunky and again leaves duplicated covariate
columns.
I might have missed it in the docs, but I'm not sure how multivariate (matrix) data work either:
df %>%
compose_data(y = as.matrix(.$species_1 : .$species_10)) #doesn't work
Cheers, A
These are great, thanks!
1. Censoring
Along with #58 I changed compose_data to execute each argument in an environment that includes the variables generated by the previous arguments. This is similar to how dplyr::mutate
works. So this works:
compose_data(
a = 1,
b = a + 1,
c = b + 1
)
$a
[1] 1
$b
[1] 2
$c
[1] 3
Similarly the censoring example now works without the .$
prefix (also auto-conversion to int from logical in compose_data
means you don't need the ifelse
, and the new derived-value setup of compose_data
means we might as well use is_censored
to compute n_censored
to avoid duplicating the censoring condition in case it changes):
df = data_frame(y = 0:10)
df %>%
compose_data(
is_censored = y == 0,
n_censored = sum(is_censored)
)
$y
[1] 0 1 2 3 4 5 6 7 8 9 10
$n
[1] 11
$is_censored
[1] 1 0 0 0 0 0 0 0 0 0 0
$n_censored
[1] 1
2. Model matrices
I'm not sure exactly what to do for the model matrix example that would improve it. Is there something in the output from that you would want that is not there already (besides not having the columns from the data frame)? Matrix dimensions come to my mind as a possibility.
3. Matrices in general
I'm not sure exactly what output you were looking for in the last example, could you give me an example of the ideal output here? (Maybe for #2 as well).
So far I've punted on doing anything special with matrices in compose_data
, and just passed them along as-is on the assumption that people would want to use them for pre-processed data (like model matrices), so they would not require additional transformation. However I'm willing to be convinced otherwise :).
Censoring looks great, thanks! Converting to logical is a nice touch.
Matrices:
These examples come up together because I have data declarations for Stan that often look like this:
data {
int<lower=1> N; // # obs
int<lower=1> S; // # species
int<lower=1> K; // # covariates
vector[K] X[N]; // covariates
vector<lower=0>[S] y[N]; // observed data
}
...
where X
is the model matrix example and y
is what I was trying to achieve with the multivariate data example. You're right that the model.matrix
setup works more or less as desired, but dimensions would sure help.
As for the multivariate data, I currently (without tidybayes
) have to convert a dataset from long to wide, then to a matrix.
y <- select(cover, id, species, cover) %>%
group_by(id, species) %>%
summarise(cover = sum(cover)) %>%
spread(species, cover, fill = 0)
data_list <- list(
y = as.matrix(y[ . -1]), # drop index
N = nrow(y),
S = ncol(y) - 1)
}
This is a nuisance because it involves separating the observation data from the covariates, which in turn need reducing to the same dimensions as the wide format.
Where I imagined tidybayes
streamlining this was if I had the full dataframe in wide format (covariates and all), then could index the columns that formed my matrices:
df %>%
compose_data(y = select(column2 : column10),
X = model.matrix(~ 1 + column11 + column12),
n = nrow(y), # better still to get the dimensions automatically.
s = ncol(y),
k = ncol(X)
)
Perhaps this is a pedantic use case, but it would cut my setup from 17 lines to 3 (ideally) and simplify post-processing.
Thanks for the support! A
Okay cool, this helps clarify! I need to look at this in more detail, but initially looking at this sort of thing:
df %>% compose_data(
y = select(column2 : column10),
X = model.matrix(~ 1 + column11 + column12),
n = nrow(y),
s = ncol(y),
k = ncol(X)
)
Setting aside the auto-generation of dimensions for the moment, assuming you had the data in wide format already, if you run the following on the latest version of tidybayes does it give the output you want?
df %>% compose_data(
y = as.matrix(select(., column2 : column10)),
X = model.matrix(~ 1 + column11 + column12, data = .),
s = ncol(y),
k = ncol(X)
)
I'm not saying this is the best syntax, I just wanted to confirm I understand the desired output.
Some points:
-
n should be auto-generated from nrow(df) already, which should be equal to nrow(y) in the above, so I omitted it.
-
The
.
is required to refer specifically todf
in the definitions ofy
andX
becausecompose_data
executes the subsequent definitions in an environment containing the elements of the data list you are generating, not in the context of the first argument passed to it (df
in this case). That approach is what allows the censoring example to work, and (I think) is more consistent with whatcompose_data
is meant to do --- the alternative, of having subsequent parameters execute in the environment of the first data frame, would be useful for tasks thatdplyr::mutate
(and other dplyr functions) are already designed for, and I'm not sure it would make sense to duplicate that functionality here. One could always put amutate
call in the pipeline if needed.
The low-hanging fruit to pull out from the above is auto-generating dimensions. To set that up generically I would have it add columns named something like "nrow_X", "ncol_X", "nrow_y", "ncol_y", which would at least allow you to do (assuming you are fine with having to rename the corresponding indices in the model):
df %>% compose_data(
y = as.matrix(select(., column2 : column10)),
X = model.matrix(~ 1 + column11 + column12, data = .)
)
I think the definition for X
is about as concise as it can be without redefining how compose_data
does lazy evaluation in a way that will make other use cases more cumbersome (like censoring). As is, the definition of X
is down to a single---pretty clear in intent---function call. So I might punt on that for now.
Let me think a little more about the "generating a matrix from specific columns" task for y
. I think the code there is reasonably concise as well, but there could a more generic class of tasks related to it that is worth specialized syntax. The only thing I can think of off the top of my head is something like matrix_from_columns(., column2 : column10)
, but it is not clear to me that that is enough clearer or more concise than as.matrix(select(., column2 : column10))
to warrant specialized syntax.
Thoughts/feedback on any of this is appreciated!
I might have missed it in the docs, but I'm not sure how multivariate (matrix) data work either:
I was close! Must mean you've got an intuitive interface.
df %>% compose_data(
y = as.matrix(select(., column2 : column10)),
X = model.matrix(~ 1 + column11 + column12, data = .),
s = ncol(y),
k = ncol(X)
)
This does what I want. Manually generating column dimensions is no extra overhead, and naming them is often helpful to remember what they represent.
Two minor points:
- keeping all the remaining dataframe columns clutters things up (think
mutate
vs.summarise
), but I think this only matters when assigning and checking the output ofcompose_data
. - if
df
is grouped, it drags the grouping variable through toy
which (in this case) ends up as a factor, but it does fail loudly and isn't hard to diagnose/debug.
Cheers, A
Awesome, glad it's working! And thanks so much for the feedback and interest --- it has helped push towards a better API for sure.
keeping all the remaining dataframe columns clutters things up (think mutate vs. summarise), but I think this only matters when assigning and checking the output of compose_data.
This in particular has been something I go back and forth on. I think the latest versions of stan no longer complain when you pass in data lists that have extra variables not present in the model, so I think this is less of an issue than it used to be. I think stan used to complain about extra variables, either that or I might be thinking of one of the JAGS interfaces. Does one of the interfaces you use complain about extra unused variables in the data list? That would help me know what to test on.
Two other solutions come to mind:
-
(Does not require new syntax) Using
[
ormagrittr::extract
to subset the list to just the elements you want:df %>% compose_data( y = as.matrix(select(., column2 : column10)), X = model.matrix(~ 1 + column11 + column12, data = .), s = ncol(y), k = ncol(X) ) %>% `[`(c("y", "X", "s", "k", "n"))
That feels less than ideal, because it makes you list all the names twice.
-
(Would require new syntax) Having an option to drop implicit/unnamed elements from the result. You would have to add the auto-generated columns back in to keep them, but you wouldn't have to redefine them. This is possibly nice if you are basically just using compose_data for the data type conversion but not the column generation:
df %>% compose_data( y = as.matrix(select(., column2 : column10)), X = model.matrix(~ 1 + column11 + column12, data = .), s = ncol(y), k = ncol(X), n, # need to make `n` explicit so it is kept, but no need to fully # define it (as in n = nrow(df)) since it is still auto-generated .drop_implicit = TRUE )
I think adding such an option would not require large changes to
compose_data
, so if that is something you would find useful I would consider it.
if df is grouped, it drags the grouping variable through to y which (in this case) ends up as a factor, but it does fail loudly and isn't hard to diagnose/debug.
This also comes up in some other parts of the tidybayes
API, because spread_samples
and other functions return pre-grouped data. My current solution is to accept that this sort of thing happens (in spread_samples
, the alternative of forcing people to specify obvious groupings is more repetitive and tedious) and hope the loud failures are enough to prompt people to add ungroup()
to their pipelines when needed.
no longer complain when you pass in data lists that have extra variables not present in the model.
I've not had a problem with Stan or JAGS, so this isn't pressing by any means. It just looks ugly if you stop to inspect the output of compose_data
.
That said, I do like the .drop_implicit
flag. The magrittr magic is ok too, but the implicit columns are only really a problem when there's large differences from the default behaviour anyway (ie. lots to name).