tidybayes icon indicating copy to clipboard operation
tidybayes copied to clipboard

Correlation matrices in gather_draws

Open elbamos opened this issue 6 years ago • 7 comments

Its difficult to use gather_draws for correlation matrices. The reason is that the same variable name has to be used twice.

E.g., correlations_hat <- samples %>% gather_draws(correlation_matrix[series, series]), which returns an error. It would be good to have a solution for this.

elbamos avatar Feb 09 '19 10:02 elbamos

This is good timing --- I am currently working on some changes to make it easier to deal with matrices in spread_draws / gather_draws. Something I am currently implementing is nested columns for n-dimensional arrays (#154), though I can't quite tell from your question if that's what you're looking for.

To make sure I understand the use case so I can better support it, is series a factor in this case, and you would like to recover the factor level names in the output based on the name series (i.e. using recover_types()) but assign the results to different columns? I.e., do you get the desired format (except with incorrect names) by doing something like gather_draws(correlation_matrix[series1, series2]), or do you want a different format altogether (e.g. something like a list column with matrices inside)?

mjskay avatar Mar 03 '19 05:03 mjskay

Thanks!

Your description of the use case and proposal are correct. This is a correlation matrix, so both rows and columns are the same factor. What I’d like to be able to do is specify the factor twice in the call to gather_draws, so the levels names are properly recovered, but have it appear as two distinct columns in the resulting data frame. (Even better would be if tidybayes could recognize that it only needs to pull one triangle from the matrix, since it’s symmetric, but if both triangles are pulled that would be just peachy.)

elbamos avatar Mar 03 '19 06:03 elbamos

Okay, cool. Since gather_draws is using the names you provide in the call to determine the column names in the output (and also to automatically do joins if you ask for multiple parameters using the same index variable), I am a little reluctant to make gather_draws(correlation_matrix[series, series]) work directly by renaming the series variable in some way. It feels a little bit like too much of a special case for this particular instance. I need to think more about whether or not there is a consistent rule for allowing multiple uses of the same name for an index variable that would not have other knock-on effects with other things that spread_draws and gather_draws does.

In the mean time, since the matching to back-translate the variables types (i.e., convert series back into a factor) is just done by name, one solution might be to make a call to recover_types that associates some additional name with the same factor levels as are in series so that that name can also be used to back-convert values.

For example, I assume somewhere above the code snippet you had a call like this, where df is a dataframe containing the column series:

samples %<>%
  recover_types(df)

recover_types can take any number of lists or data frames containing named variables to use in spread_draws or gather_draws to automatically back-convert types. So you could alias another variable (say series2) to series in order to have that variable back-translated into the same factor levels. You could do that along with the original recover_types call:

samples %<>%
  recover_types(
    df,
    select(df, series2 = series)
  )

Then use series2 as follows:

correlations_hat = samples %>%
  gather_draws(correlation_matrix[series, series2])

Or if it is a one-off just for the calculation of correlations_hat and not used anywhere else, you could do it inline with your call to gather_draws instead of placing it with your first call to recover_types:

correlations_hat = samples %>%
  recover_types(select(df, series2 = series)) %>%
  gather_draws(correlation_matrix[series, series2])

One argument for the above might be clarity, in that the definition of the alias is made directly next to its use, which could make the intent clearer to a reader.

Does that help?

mjskay avatar Mar 09 '19 01:03 mjskay

That would work, and it does help!

I would suggest, however, that this is a special case that occurs often. Really, its any time you want to recover a correlation or covariance matrix. So its something you may want to consider supporting specifically in the future.

elbamos avatar Mar 09 '19 01:03 elbamos

Great, glad that helps!

W.r.t. coming up with a solution that doesn't require the extra call to recover_types, one option might be to make it so that duplicated columns names (after the first one) are automatically renamed, e.g. to have gather_draws(correlation_matrix[series, series]) back-convert using the provided name but automatically rename the second one to series.1 or something like that (then series.2, etc, if multiple indices with the same name are provided). That would allow joins to continue working correctly, so that something like gather_draws(x[series], correlation_matrix[series,series]) would give reasonable output (draws in the same row as x[i] would be paired with correlation_matrix[i, ...]). The limitation there is that there's then no obvious way to specify the opposite behavior (pairing x[i] with correlation_matrix[..., i], for example) without going back to the manual recover_types version, which makes this solution feel clumsy to me.

Another approach would be to make the automatic renaming assign the first series to series.1 and the second to series.2, but then gather_draws(x[series], correlation_matrix[series,series]) would break in a way that might be confusing to users because correlation_matrix[series,series] no longer generates a column with the name series. So you would have to do something like gather_draws(x[series.1], correlation_matrix[series,series]), which feels neither elegant nor transparent.

Another option might be an inline renaming mechanism within gather/spread_draws, but I think this would end up not being much more compact than the method using recover_types I suggested above, so I am a little reluctant to add that.

A final option would be to tweak recover_types to make inline renaming easier with it, e.g. by allowing syntax like this:

correlations_hat = samples %>%
  recover_types(series2 = series)  %>%
  gather_draws(correlation_matrix[series, series2])

That would be a relatively straightforward change, and would match up with how variables can be re-used in context in compose_data, so it has some desirable properties from the perspective of API consistency.

Would be curious what you think about any of the above proposals.

mjskay avatar Mar 09 '19 02:03 mjskay

I think you only want to rename the second usage. So if some calls gather_draws(alpha[series], beta[series], Sigma[series, series]) what you get is a data frame of chain, iteration, etc., series, series.1.

On Mar 8, 2019, at 6:26 PM, Matthew Kay [email protected] wrote:

Great, glad that helps!

W.r.t. coming up with a solution that doesn't require the extra call to recover_types, one option might be to make it so that duplicated columns names (after the first one) are automatically renamed, e.g. to have gather_draws(correlation_matrix[series, series]) back-convert using the provided name but automatically rename the second one to series.1 or something like that (then series.2, etc, if multiple indices with the same name are provided). That would allow joins to continue working correctly, so that something like gather_draws(x[series], correlation_matrix[series,series]) would give reasonable output (draws in the same row as x[i] would be paired with correlation_matrix[i, ...]). The limitation there is that there's then no obvious way to specify the opposite behavior (pairing x[i] with correlation_matrix[..., i], for example) without going back to the manual recover_types version, which makes this solution feel clumsy to me.

Another approach would be to make the automatic renaming assign the first series to series.1 and the second to series.2, but then gather_draws(x[series], correlation_matrix[series,series]) would break in a way that might be confusing to users because correlation_matrix[series,series] no longer generates a column with the name series. So you would have to do something like gather_draws(x[series.1], correlation_matrix[series,series]), which feels neither elegant nor transparent.

Another option might be an inline renaming mechanism within gather/spread_draws, but I think this would end up not being much more compact than the method using recover_types I suggested above, so I am a little reluctant to add that.

A final option would be to tweak recover_types to make inline renaming easier with it, e.g. by allowing syntax like this:

correlations_hat = samples %>% recover_types(series2 = series) %>% gather_draws(correlation_matrix[series, series2]) That would be a relatively straightforward change, and would match up with how variables can be re-used in context in compose_data, so it has some desirable properties from the perspective of API consistency.

Would be curious what you think about any of the above proposals.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

elbamos avatar Mar 09 '19 02:03 elbamos

Makes sense. Okay, perhaps a sensible solution is approach 1 + approach 4 to make more flexible cases possible through explicit renaming.

So TODO:

  • [ ] make repeat uses of the same index within a single variable auto-rename all but the first use. x[i,i,i,...] -> x, i, i.1, i.2, ...
  • [ ] allow compose_data-like re-use of already-defined aliases within recover_types.
    • [ ] might need to look at creating a more explicit class for type constructors to make this work

mjskay avatar Mar 09 '19 03:03 mjskay