posterior Factor rvars

Summary

This is a draft PR for add factor-like rvars, for #149.

Basic idea is that the backing array can be a factor or ordered with a "levels" attribute. These are auto-detected when objects with a "levels" attribute are passed to rvar(), or can be created explicitly with rvar_factor(), rvar_ordered(), as_rvar_factor(), or as_rvar_ordered().

Simple example:

x = rvar(array(c("a","a","b","c","d","d","d","a"), dim = c(4,2)))
x
#> rvar_factor<4>[2] mode <entropy>:
#> [1] a <1.04>  d <0.56> 
#> 4 levels: a b c d

Currently the summary output in vector / array layouts is modal category with entropy. Entropy was the best thing I could think of for a single summary of uncertainty in categorical distributions, but I welcome other suggestions. Also not sure what the best format is (the "±" format might also make less sense for entropy; e.g. maybe for entropy we put it in ( ) or < > or something).

As a quick check, the auto-detection works on output of posterior predict for both rstanarm::stan_polr() (which uses strings) and brms() (which uses an array with a "levels" attribute).

E.g.:

library(brms)

df <- esoph
m <- brm(tobgp ~ agegp, data = df, family = cumulative, seed = 12345, backend = "cmdstanr")

df$y_rep = rvar(posterior_predict(m))
head(df, n = 15)
#>    agegp     alcgp    tobgp ncases ncontrols          y_rep
#> 1  25-34 0-39g/day 0-9g/day      0        40    10-19 <1.4>
#> 2  25-34 0-39g/day    10-19      0        10    10-19 <1.4>
#> 3  25-34 0-39g/day    20-29      0         6    10-19 <1.4>
#> 4  25-34 0-39g/day      30+      0         5 0-9g/day <1.4>
#> 5  25-34     40-79 0-9g/day      0        27    10-19 <1.4>
#> 6  25-34     40-79    10-19      0         7    10-19 <1.4>
#> 7  25-34     40-79    20-29      0         4    10-19 <1.4>
#> 8  25-34     40-79      30+      0         7    10-19 <1.4>
#> 9  25-34    80-119 0-9g/day      0         2    10-19 <1.4>
#> 10 25-34    80-119    10-19      0         1    10-19 <1.4>
#> 11 25-34    80-119      30+      0         2    10-19 <1.4>
#> 12 25-34      120+ 0-9g/day      0         1    10-19 <1.4>
#> 13 25-34      120+    10-19      1         0    10-19 <1.4>
#> 14 25-34      120+    20-29      0         1    10-19 <1.4>
#> 15 25-34      120+      30+      0         2 0-9g/day <1.4>

As this is an early draft, there are still lots of corners to smooth and test, but I'd love feedback in various places:

If anyone using categorical models can give this a test drive that'd be useful.
Thoughts on the printing output? (see discussion above)
What should we do about conversation to/from other draws formats?
Any amount of kicking the tires is much appreciated. (N.B. I haven't gotten to conversion functions b/w draws formats with factors, but I think everything else has a draft implementation).
Currently these things probably won't be that easy to visualize. {ggdist} will get some kind of support for them but I haven't figured that out yet.

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

Oct 09 '22 05:10 mjskay

The samples project of ARFoundation works for me, but as soon as I install ARCore Extensions, including Cloud Anchors, and I add it to the sample scene SimpleAR, the FPS drops so low that tracking stops working altogether (instantiated planes disappear). Other sample scenes without the ARCore Extensions game object keep working fine.

Unity 2021.3.10f1 ARFoundation / ARCore / ARKit 4.2.6 ARCore Extensions 1.33.0 — with Cloud Anchors enabled Xcode 14.0 iPhone 12 Pro with iOS 16

Oct 09 '22 05:10 codecov-commenter

Thank you for working on this! I don't have much input right now but at least a few thoughts.

I like the mode <entropy> printing.
Do we want to support categorical variables more generally in all formats? If yes, it could make sense to design a consistent interface across all at this point. Said differently, could there be any problems we run into later in other formats if we only implement categorical variables in rvars now?

Oct 14 '22 09:10 paul-buerkner

I like the mode <entropy> printing.

works for me

Do we want to support categorical variables more generally in all formats? If yes, it could make sense to design a consistent interface across all at this point. Said differently, could there be any problems we run into later in other formats if we only implement categorical variables in rvars now?

Thinking about the different formats:

draws_list and draws_df should have no problems with categorical variables, as they could just use factor and ordered types for those variables
draws_matrix and draws_array I'm not sure what to do with, as every variable in an array/matrix must have the same type. One option is that we make converting to them be lossy by changing factors to integers. Another option would be to store a list of levels for each variable as an attribute of the draws object (e.g. similar to the way that dimnames are stored). But we would also need to store if they are ordered or not, and possibly handle modifications of the variables too... ultimately it might amount to reimplementing the factor type and not be worth it.

I think in the short term we could add support for factors to draws_list and draws_df easily, and go with the lossy-cast-to-integer approach for the other two formats. Then we can decide later if we want to deal with the complexity of adding support to them. I think this should be future-proof...?

Oct 15 '22 02:10 mjskay

Matthew, we chatted on Twitter about possible ways to summarize information on posterior distributions of categorical variables. In particular, I wondered if the entropy reported for the posterior distribution of a categorical variable was computed differently depending on whether the categorical variable was nominal or ordinal. (It didn't seem like it was.)

In doing some more digging about this, I found an R package called agrmt (agrmt: Calculate Concentration and Dispersion in Ordered Rating Scales) which purports to do the following:

Calculates concentration and dispersion in ordered rating scales. It implements various measures of concentration and dispersion to describe what researchers variably call agreement, concentration, consensus, dispersion, or polarization among respondents in ordered data. It also implements other related measures to classify distributions. In addition to a generic city-block based concentration measure and a generic dispersion measure, the package implements various measures, including van der Eijk's (2001) <doi:10.1023/A:1010374114305> measure of agreement A, measures of concentration by Leik, Tatsle and Wierman, Blair and Lacy, Kvalseth, Berry and Mielke, Reardon, and Garcia-Montalvo and Reynal-Querol. Furthermore, the package provides an implementation of Galtungs AJUS-system to classify distributions, as well as a function to identify the position of multiple modes.

On the surface of it, it sounds like this package should give you some more options in terms of describing posterior distributions of ordinal categorical variables.

Here's some R code I put together to get you started - it is based on the package vignette available at https://cran.r-project.org/web/packages/agrmt/vignettes/agrmt.pdf and on the help files available for this package. In the code, I used a frequency vector mentioned at the beginning of the vignette, frequencies <- c(10, 20, 30, 15, 4), which corresponds to the following 5-category bar plot:


library(agrmt)

categories <- c(1, 2, 3, 4, 5)
frequencies <- c(10, 20, 30, 15, 4)

freqtable <- data.frame(categories, frequencies)

freqtable$categories <- factor(freqtable$categories, ordered = TRUE)

library(ggplot2) 
library(magrittr)
theme_set(theme_bw())

plot_freqtable <- 
freqtable %>% 
  ggplot() + 
  aes(x = categories, y = frequencies, fill = categories) + 
  geom_col() +
  labs(x = "Category", y = "Frequency") + 
  theme(legend.position="none") 

plot_freqtable

IMPORTANT NOTE:

_Some functions in the agrmt package require frequency vectors while others require standardized frequency vectors, that is the frequencies expressed as a proportion.

Where standardized frequency vectors are required, frequency vectors that do not sum to 1 (i.e. that are not standardized) are automatically standardized by dividing each element of the frequency vector by the sum of the vector. This assumes complete data; a general assumption of the measures in this package._

#==============================================================
# 1) Agreement: Levels of agreement range from −1 to 1
# Agreement is only defined if there are at least three response categories, 
# and it does not tell you which of the categories is the most common one. 
# For this reason, it is also advisable to look at an appropriate measure of 
#central tendency, such as the interpolated median.
#==============================================================

# To calculate agreement, we simply call agreement with the frequency vector as
# the argument. There is normally no reason to use the old algorithm, which can
# be set using old=TRUE. This is the ‘old’ algorithm as described by Van der Eijk.

help(agreement, package = "agrmt") 

library(dplyr)

agreement_freqtable <- 
freqtable %>% 
  pull(frequencies) %>% 
  agreement() %>% 
  round(2)

agreement_freqtable  # 0.39

plot_freqtable + 
  ggtitle(paste("Agreement:", agreement_freqtable))
  

#==============================================================
# 2) Polarization
#
# The agreement scores are suitable to express polarization, but the numbers are
# not intuitive to interpret. The function polarization offers easily interpretable
# values by re-scaling agreement A. More precisely, it gives (1 − agreement)/2.
# This means that a polarization value of 1 means perfect polarization, 
# and a value of 0 means perfect agreement. A value of 0.5 corresponds to  
# ‘no agreement’.
#==============================================================

polarization_freqtable <- 
  freqtable %>% 
  pull(frequencies) %>% 
  polarization() %>% 
  round(2)

polarization_freqtable  # 0.30

#==============================================================
# 3) Leik’s Consensus (measure of ordinal dispersion):
# 
# If all observations are in the same category, ordinal dispersion is 0. 
# With half the observations in one extreme category, and half the observations 
# in the other extreme, Leik’s measure gives a value of 1.
#
# Leik’s ordinal dispersion measure is a percentage, and can be interpreted
# accordingly. [Important Note: R reports it as a proportion, not as a percentage!]
# 
# Ordinal dispersion can be used to express consensus or agreement, 
# simply by taking: 
#    1 - ordinal dispersion.
#==============================================================

help(Leik, package = "agrmt")

leik_freqtable <- 
  freqtable %>% 
  pull(frequencies) %>% 
  Leik() %>% 
  round(2)

leik_freqtable  # 0.40

#==============================================================
# 4) Tatsle and Wierman’s Consensus (computed using the Shannon entropy as the basis)
# 
#  It is 1 if there is perfect uniformity; 
#  it is 0 if there is perfect bimodality (=lack of agreement)
#==============================================================

help(consensus, package = "agrmt")

consensus_freqtable <- 
  freqtable %>% 
  pull(frequencies) %>% 
  consensus() %>% 
  round(2)

consensus_freqtable  # 0.62

#==============================================================
# 5) Entropy
# 
# The Shannon entropy can be calculated using the entropy() function. 
# This function follows TW (i.e., Tatsle and Wierman) and ignores categories 
# with zero observations. 
#
# The Shannon entropy ignores the order of categories; use the consensus() 
# function to consider the order of categories.
#==============================================================

help(entropy, package = "agrmt")

entropy_freqtable <- 
  freqtable %>% 
  pull(frequencies) %>% 
  entropy() %>% 
  round(2)

entropy_freqtable  # 2.08


#==============================================================
# Aside:   We can also classify a distribution using the AJUS-system introduced by Galtung (1969), 
# with the aim of reducing complexity. 
# 
# Distributions are classified as A if they are unimodal with a peak in the centre, 
# as J if they are unimodal with a peak at either end, 
# as U if they are bimodal with a peak at both ends, and as S if they are multimodal. 
# 
# In addition to Galtung's classification, the function classifies distributions 
# as F if there is no peak and all values are more or less the same (flat). 
# 
# Furthermore, a distinction is drawn between J and L distributions, 
# depending on whether they increase or decrease: J types have a peak on the right, 
# L types have the peak on the left. The skew is given as +1 for a positive skew, 
# as 0 for no skew, and -1 for a negative skew.
# 
# The skew is identified by comparing the sum of values left and right of the midpoint 
# respectively. For J-type of distributions, the skew is identified on the basis of the 
# changes between values. This way, long tails cannot influence the skew, 
# and a single peak at the left and right-hand end can be differentiated in all cases.
# 
#==============================================================

help(ajus, package = "agrmt")

freqtable %>% 
  pull(frequencies) %>% 
  ajus()

My thinking is that the user could be given a menu of possibilities for describing the posterior distribution corresponding to an ordinal categorical variable via some of the functions in the agrmt package: agreement(), polarization(), Leik(), consensus(), entropy(), etc. (Note that the package includes other functions - I haven't explored them all.)

Thanks, Matthew - I hope some of the above will be helpful to you.

Isabella

Oct 19 '22 17:10 isabellaghement

Thank you @isabellaghement for checking out all these options!

Oct 20 '22 10:10 paul-buerkner

Thanks @isabellaghement, this is very useful!

I think we can avoid a dep on {agrmt} at least for the default printing since the implementations of these are very straightforward, and I don't think we need a bunch of different measures of dispersion just for the rvar printing function (for continuous dists we only provide SD and MAD, for example). For print.rvar() I'd rather choose one or two reasonable defaults, then if people want more dispersion measures they can calculate them using summarise_draws(), not by setting the summary argument of print.rvar() (which is always awkward to use anyway, since people rarely call print() directly).

After reading through these examples and looking at Tastle and Wierman, their "dissention" measure seems pretty reasonable to me: it's based on entropy, and ranges from 0 (all probability in one category) through 0.5 (uniform) to 1 (all probability split evenly between first and last category). So we could use that as a default for ordinal factors, entropy for categorical, and if people want something else they can calculate it with summarise_draws().

Oct 23 '22 01:10 mjskay

Thinking about summary functions raises another question for @paul-buerkner: do we want the entropy() and dissent() functions to be exported so that people can use them in summarise_draws()? I suppose we may also want to export a version of .mode() (perhaps one that errors if applied to numeric data, and with a different name... something like modal_category() or modal_level() but maybe less wordy).

Oct 23 '22 20:10 mjskay

FYI this now has pretty comprehensive tests. All that remains is for me to update the rvar vignette and to make any adjustments based on folks' feedback.

Re: conversions to other formats, the current code supports conversion to/from draws_df and draws_list and will keep discrete variables as factors or ordereds in those formats. When converting to draws_matrix or draws_array, the conversion is "lossy" and will result in numeric variables. Not sure if there's a better solution than that for now. Could raise a warning in those cases if desired?

Oct 23 '22 20:10 mjskay

I think exporting these functions sounds good. I don't mind wordy names, so I am happy with both modal_category and modal_level, perhaps with a slight preference for the former. What would you prefer?

With regard to conversion, it should be possible to support categorical only draws_array and draws_matrix, right? Only if we mixed categorical and numerical variables, the conversion would be lossy (and then trigger a warning).

Oct 24 '22 22:10 paul-buerkner

I think exporting these functions sounds good. I don't mind wordy names, so I am happy with both modal_category and modal_level, perhaps with a slight preference for the former. What would you prefer?

modal_category works. TODO for me:

[x] export and document modal_category, entropy, dissent

With regard to conversion, it should be possible to support categorical only draws_array and draws_matrix, right? Only if we mixed categorical and numerical variables, the conversion would be lossy (and then trigger a warning).

Hmm, fair point. It could get a little hairy since we'd probably have to apply a class (or at least an attribute) to remember that the arrays are factors or ordereds... and base-R factor() and ordered() typically do not work well as arrays (they tend to lose their dimensions). I can think about it and/or try something out to see what might work.

Oct 25 '22 01:10 mjskay

With regard to conversion, it should be possible to support categorical only draws_array and draws_matrix, right? Only if we mixed categorical and numerical variables, the conversion would be lossy (and then trigger a warning).

Hmm, fair point. It could get a little hairy since we'd probably have to apply a class (or at least an attribute) to remember that the arrays are factors or ordereds... and base-R factor() and ordered() typically do not work well as arrays (they tend to lose their dimensions). I can think about it and/or try something out to see what might work.

I've come to the conclusion that this isn't really worth it (I added rationale to the top comment). For now, I have gone with lossy conversion + a warning. Long term, we may want a way of storing type information for each variable in a draws_array/matrix, which would allow us to address this issue.

I believe this PR has now reached a "ready" state. I have updated the summary comment to be more comprehensive, as well as NEWS.md. Feedback welcome!

Nov 24 '22 04:11 mjskay

The factor_rvar branch of {ggdist} now has experimental support for rvar_factors as well. Simple example of brms + posterior + ggdist with an ordinal model:

library(brms)
library(posterior)
library(ggdist)

df <- esoph
m <- brm(tobgp ~ agegp, data = df, family = cumulative, seed = 12345, backend = "cmdstanr")

grid <- unique(df[,"agegp", drop = FALSE])
grid$pred <- rvar_ordered(posterior_predict(m, newdata = grid))

grid |>
  ggplot(aes(x = agegp, ydist = pred)) +
  stat_slab()

Relevant issue over there is: https://github.com/mjskay/ggdist/issues/108

Nov 25 '22 01:11 mjskay

This PR looks already really good, up to a few minor things, mostly style related.

Thanks for the really thorough review! I've addressed everything except two comments (see those replies).

Some of the rvar internals remain a bit of black magic to me, so I don't understand every detail of it, but I trust you and the tests you put in place on that one.

Heh, yeah, I agree it is pretty arcane. Getting rvar to mimic everything it does so far has been quite a journey (base R datatypes are... certainly something!), and I should probably clean up and document the rvar internals a bit more with everything I've had to figure out to get it to work (plus I sometimes forget when I've been away from it a bit).

Nov 29 '22 04:11 mjskay

Thank you for your changes and explanations of what you what to keep the way it is. It does all make sense to me. I want to give people a bit more time to check out the new features and perhaps add a review. So I will put a reminder in my calender for start of next week to merge this PR if no new things have come up until then. Would that sound good to you?

Nov 29 '22 12:11 paul-buerkner

Sounds great, thanks!

Should we ping some folks who might be able to provide feedback? Off the top of my head: @isabellaghement, @bwiernik, @jgabry, @avehtari? If any of you have a chance to "kick the tires" of rvar_factor() / rvar_ordered() on this branch sometime in the next week, it would be much appreciated (of course, no pressure!).

Nov 29 '22 18:11 mjskay

pinging @TeemuSailynoja as he has been running recently some categorical and ordinal models

Nov 29 '22 18:11 avehtari

Try to run some tests with it this weekend

Nov 29 '22 18:11 bwiernik

@mjskay From my side, this is ready to merge. Anything you want to change before I merge?

Dec 07 '22 09:12 paul-buerkner

Sorry for the delayed response - yes, I think this is good for now. I've built some stuff in ggdist on top of it and haven't run into anything new that needs changing, so that's the best I can say at the moment without other folks testing it. Thanks!

Dec 20 '22 17:12 mjskay

Lovely, merging now. Thank you again for your enormous effort to make this feature possible!

Dec 20 '22 18:12 paul-buerkner

Thanks!

Dec 20 '22 23:12 mjskay