tidybayes
tidybayes copied to clipboard
Correlation matrices in gather_draws
Its difficult to use gather_draws
for correlation matrices. The reason is that the same variable name has to be used twice.
E.g., correlations_hat <- samples %>% gather_draws(correlation_matrix[series, series])
, which returns an error. It would be good to have a solution for this.
This is good timing --- I am currently working on some changes to make it easier to deal with matrices in spread_draws / gather_draws. Something I am currently implementing is nested columns for n-dimensional arrays (#154), though I can't quite tell from your question if that's what you're looking for.
To make sure I understand the use case so I can better support it, is series
a factor in this case, and you would like to recover the factor level names in the output based on the name series
(i.e. using recover_types()
) but assign the results to different columns? I.e., do you get the desired format (except with incorrect names) by doing something like gather_draws(correlation_matrix[series1, series2])
, or do you want a different format altogether (e.g. something like a list column with matrices inside)?
Thanks!
Your description of the use case and proposal are correct. This is a correlation matrix, so both rows and columns are the same factor. What I’d like to be able to do is specify the factor twice in the call to gather_draws, so the levels names are properly recovered, but have it appear as two distinct columns in the resulting data frame. (Even better would be if tidybayes could recognize that it only needs to pull one triangle from the matrix, since it’s symmetric, but if both triangles are pulled that would be just peachy.)
Okay, cool. Since gather_draws
is using the names you provide in the call to determine the column names in the output (and also to automatically do joins if you ask for multiple parameters using the same index variable), I am a little reluctant to make gather_draws(correlation_matrix[series, series])
work directly by renaming the series
variable in some way. It feels a little bit like too much of a special case for this particular instance. I need to think more about whether or not there is a consistent rule for allowing multiple uses of the same name for an index variable that would not have other knock-on effects with other things that spread_draws
and gather_draws
does.
In the mean time, since the matching to back-translate the variables types (i.e., convert series
back into a factor) is just done by name, one solution might be to make a call to recover_types
that associates some additional name with the same factor levels as are in series
so that that name can also be used to back-convert values.
For example, I assume somewhere above the code snippet you had a call like this, where df
is a dataframe containing the column series
:
samples %<>%
recover_types(df)
recover_types
can take any number of lists or data frames containing named variables to use in spread_draws
or gather_draws
to automatically back-convert types. So you could alias another variable (say series2
) to series
in order to have that variable back-translated into the same factor levels. You could do that along with the original recover_types
call:
samples %<>%
recover_types(
df,
select(df, series2 = series)
)
Then use series2
as follows:
correlations_hat = samples %>%
gather_draws(correlation_matrix[series, series2])
Or if it is a one-off just for the calculation of correlations_hat
and not used anywhere else, you could do it inline with your call to gather_draws
instead of placing it with your first call to recover_types
:
correlations_hat = samples %>%
recover_types(select(df, series2 = series)) %>%
gather_draws(correlation_matrix[series, series2])
One argument for the above might be clarity, in that the definition of the alias is made directly next to its use, which could make the intent clearer to a reader.
Does that help?
That would work, and it does help!
I would suggest, however, that this is a special case that occurs often. Really, its any time you want to recover a correlation or covariance matrix. So its something you may want to consider supporting specifically in the future.
Great, glad that helps!
W.r.t. coming up with a solution that doesn't require the extra call to recover_types
, one option might be to make it so that duplicated columns names (after the first one) are automatically renamed, e.g. to have gather_draws(correlation_matrix[series, series])
back-convert using the provided name but automatically rename the second one to series.1
or something like that (then series.2
, etc, if multiple indices with the same name are provided). That would allow joins to continue working correctly, so that something like gather_draws(x[series], correlation_matrix[series,series])
would give reasonable output (draws in the same row as x[i]
would be paired with correlation_matrix[i, ...]
). The limitation there is that there's then no obvious way to specify the opposite behavior (pairing x[i]
with correlation_matrix[..., i]
, for example) without going back to the manual recover_types
version, which makes this solution feel clumsy to me.
Another approach would be to make the automatic renaming assign the first series
to series.1
and the second to series.2
, but then gather_draws(x[series], correlation_matrix[series,series])
would break in a way that might be confusing to users because correlation_matrix[series,series]
no longer generates a column with the name series
. So you would have to do something like gather_draws(x[series.1], correlation_matrix[series,series])
, which feels neither elegant nor transparent.
Another option might be an inline renaming mechanism within gather/spread_draws, but I think this would end up not being much more compact than the method using recover_types
I suggested above, so I am a little reluctant to add that.
A final option would be to tweak recover_types
to make inline renaming easier with it, e.g. by allowing syntax like this:
correlations_hat = samples %>%
recover_types(series2 = series) %>%
gather_draws(correlation_matrix[series, series2])
That would be a relatively straightforward change, and would match up with how variables can be re-used in context in compose_data
, so it has some desirable properties from the perspective of API consistency.
Would be curious what you think about any of the above proposals.
I think you only want to rename the second usage. So if some calls gather_draws(alpha[series], beta[series], Sigma[series, series]) what you get is a data frame of chain, iteration, etc., series, series.1.
On Mar 8, 2019, at 6:26 PM, Matthew Kay [email protected] wrote:
Great, glad that helps!
W.r.t. coming up with a solution that doesn't require the extra call to recover_types, one option might be to make it so that duplicated columns names (after the first one) are automatically renamed, e.g. to have gather_draws(correlation_matrix[series, series]) back-convert using the provided name but automatically rename the second one to series.1 or something like that (then series.2, etc, if multiple indices with the same name are provided). That would allow joins to continue working correctly, so that something like gather_draws(x[series], correlation_matrix[series,series]) would give reasonable output (draws in the same row as x[i] would be paired with correlation_matrix[i, ...]). The limitation there is that there's then no obvious way to specify the opposite behavior (pairing x[i] with correlation_matrix[..., i], for example) without going back to the manual recover_types version, which makes this solution feel clumsy to me.
Another approach would be to make the automatic renaming assign the first series to series.1 and the second to series.2, but then gather_draws(x[series], correlation_matrix[series,series]) would break in a way that might be confusing to users because correlation_matrix[series,series] no longer generates a column with the name series. So you would have to do something like gather_draws(x[series.1], correlation_matrix[series,series]), which feels neither elegant nor transparent.
Another option might be an inline renaming mechanism within gather/spread_draws, but I think this would end up not being much more compact than the method using recover_types I suggested above, so I am a little reluctant to add that.
A final option would be to tweak recover_types to make inline renaming easier with it, e.g. by allowing syntax like this:
correlations_hat = samples %>% recover_types(series2 = series) %>% gather_draws(correlation_matrix[series, series2]) That would be a relatively straightforward change, and would match up with how variables can be re-used in context in compose_data, so it has some desirable properties from the perspective of API consistency.
Would be curious what you think about any of the above proposals.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Makes sense. Okay, perhaps a sensible solution is approach 1 + approach 4 to make more flexible cases possible through explicit renaming.
So TODO:
- [ ] make repeat uses of the same index within a single variable auto-rename all but the first use.
x[i,i,i,...]
->x
,i
,i.1
,i.2
, ... - [ ] allow
compose_data
-like re-use of already-defined aliases withinrecover_types
.- [ ] might need to look at creating a more explicit class for type constructors to make this work