posterior
posterior copied to clipboard
Add convergence warnings
The current convergence warnings in our packages (e.g., in rstan and brms) are inconsistent and partly confusing to users. Examples are given in https://discourse.mc-stan.org/t/improve-warnings-for-low-ess/15355. Other examples include warnings because of rhat or ess being NA, which is most likely caused by constant parameters, which does not pose a problem if the parameters are supposed to be constant (e.g., a correlation of a variable with itself).
We should move all of these warning messages to posterior, make them consistent and improve their informativeness, by mentioning different reasons for the potential problems and pointing to more detailed doc.
Yeah, at least for the rhat and ess warnings this makes sense. But presumably we wouldn't include the warnings about divergences, treedepth, and e-bfmi given that they are Stan-specific? Or did you want to broaden the scope of posterior to summarize sampler diagnostics?
I didnt plan on including algorithm specific diagnostics. @avehtari initially suggested moving the general warnings to posterior (which I agree with). I would like to hear his take on the algorithm specific diagnostics as well.
Jonah Gabry [email protected] schrieb am Fr., 19. Juni 2020, 21:14:
Yeah, at least for the rhat and ess warnings this makes sense. But presumably we wouldn't include the warnings about divergences, treedepth, and e-bfmi given that they are Stan-specific? Or did you want to broaden the scope of posterior to summarize sampler diagnostics?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/posterior/issues/77#issuecomment-646823592, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2AFZJWXXTJBYJZWUV53RXO2JJANCNFSM4OC37OOQ .
But presumably we wouldn't include the warnings about divergences, treedepth, and e-bfmi given that they are Stan-specific?
They are not Stan-specific, but NUTS / dynamic HMC algorithm-specific. It's good question whether posterior package would be the place for algorithm-specific diagnostics. There can be other inference packages using NUTS / dynamic HMC, too.
How is bayesplot going forward with posterior? bayesplot has algorithm-specific diagnostic plots, but is bayesplot going to use posterior inside? If so, would it then be natural to allow (some) posterior object to carry algorithm-specific information, too? Or would it be a Stan-specific object inherited from posterior with additional Stan-specific parts?
Also the algorithm-specific diagnostics should be improved, too. Again, there is the question, where would be the best place for advise, e.g., what to do if any divergences are observed.
They are not Stan-specific, but NUTS / dynamic HMC algorithm-specific.
Yeah good point.
How is bayesplot going forward with posterior? bayesplot has algorithm-specific diagnostic plots, but is bayesplot going to use posterior inside? If so, would it then be natural to allow (some) posterior object to carry algorithm-specific information, too? Or would it be a Stan-specific object inherited from posterior with additional Stan-specific parts?
Yeah I'd like to use posterior inside bayesplot, so I agree it would definitely be nice if posterior handled some algorithm-specific information. I think it can do this without needing to know about Stan. Taking CmdStanR as an example, both fit$draws() and fit$sampler_diagnostics() return posterior draws_array objects:
library(cmdstanr)
fit <- cmdstanr_example("schools")
str(fit$draws())
'draws_array' num [1:1000, 1:4, 1:11] -45.5 -45.5 -44.8 -50.1 -46.9 ...
- attr(*, "dimnames")=List of 3
..$ iteration: chr [1:1000] "1" "2" "3" "4" ...
..$ chain : chr [1:4] "1" "2" "3" "4"
..$ variable : chr [1:11] "lp__" "mu" "tau" "theta[1]" ...
str(fit$sampler_diagnostics())
'draws_array' num [1:1000, 1:4, 1:6] 0.6617 0.1201 0.3489 0.0429 1 ...
- attr(*, "dimnames")=List of 3
..$ iteration: chr [1:1000] "1" "2" "3" "4" ...
..$ chain : chr [1:4] "1" "2" "3" "4"
..$ variable : chr [1:6] "accept_stat__" "stepsize__" "treedepth__" "n_leapfrog__" ...
so posterior would just need to know how to handle draws objects containing variables like divergent__ and treedepth__, which other software besides Stan could also provide to posterior in draws format.
So maybe posterior just needs separate functions for diagnosing general MCMC and diagnosing particular algorithms, e.g.
fit$draws() %>% posterior::diagnose_mcmc()
fit$sampler_diagnostics() %>% posterior::diagnose_nuts()
which then could be wrapped up into a single method inside interfaces if desired:
fit$diagnose() # internally does the two lines above
Anyway, just brainstorming here. What do you all think?
Also the algorithm-specific diagnostics should be improved, too. Again, there is the question, where would be the best place for advise, e.g., what to do if any divergences are observed.
I agree!
I agree. We should indeed handle those in posterior. One question I think we should discuss is how to name those columns. I know stan uses *__ for protected variables wheres posterior used .*. My question is whether we should align these naming conventions at least within posterior itself.
Yeah, what if posterior allows both as input, but always uses .* internally?
@paul-buerkner @avehtari Any further thoughts on this topic? I've been meaning to think more about this but haven't really had the time yet unfortunately. Should we try to find a time to discuss this in more detail? For better or worse, this is the kind of functionality we should try to get (close to) right the first time because the diagnostic warnings will potentially affect a lot of users/packages.
We had a very good discussion about this with @paul-buerkner last week, and I hope he made notes afterwards. We could have a video call after Wednesday.
Ok great, sounds good
I made notes and will post them here in some better written form tomorrow.
Am Mo., 28. Sept. 2020 um 21:45 Uhr schrieb Jonah Gabry < [email protected]>:
Ok great, sounds good
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/posterior/issues/77#issuecomment-700214282, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2AAMCYCLEDPNKOSAUSDSIDKV3ANCNFSM4OC37OOQ .
Aki and I discussed how we can make the warning system modular to achieve the following things:
- Activate and deactivate warnings selectively and globally (via
options) - Change the thresholds after which warnings are issued selectively and globally
This should allow both users and package developers building on top of posterior to adjust the warning system to their needs. We could think of a function creating a special S3 object class which controls all the warnings and thresholds. This could work a little like theme_set(theme(...)) in ggplot2, where we adjust warning info and then set if globally.
There are a few questions remaining:
- Where should the warnings itself live? I see two options, but there may be more.
- They can be in the convergence functions themselves (
rhat,bulk_essetc.) but then it becomes hard to control warnings in a way that they are not printed for each parameter if a lot of parameters are checked. - In
summarize_drawsbut than we have to develop a framework to detect convergence functions out of all passed summary functions and handle their output correctly. I feel that this is probably the better option but I may be overlooking something.
-
What level of control should we be giving users and developers over the content of the warnings?
-
Should there only be an option to adjust warnings globally or should the warning 'theme' object also being passable to, say,
summarize_draws. From a developing perspective, explicitely passing the warning object at least as one options seems like the preferrable approach to me.
@avehtari feel free to add more if you feel I have missed anything.
Cool, thanks for sharing the notes.
Did you talk at all about whether or not to include HMC/NUTS-specific diagnostics? I think it would be good to include those so that all convergence warnings can be handled the same. But if we include those then how does that change the question of where the warnings should live? Right now summarize_draws() doesn't know anything about those other diagnostics like divergences and treedepth. Would we expand the input to summarize_draws() to also take in these extra sampler diagnostics? Should there be two separate functions summarize_draws() and diagnose_draws(), where the former doesn't do diagnostics at all and the latter handles all diagnostics? I don't know, just throwing out some ideas.
- What level of control should we be giving users and developers over the content of the warnings?
By control over content do you mean whether they should be able to turn of certain warnings but keep other ones? Or something different?
- Should there only be an option to adjust warnings globally or should the warning 'theme' object also being passable to, say, summarize_draws. From a developing perspective, explicitely passing the warning object at least as one options seems like the preferrable approach to me.
Yeah I like the idea of explicitly passing the warning object.
I think that having two separate methods for summarize_draws() and diagnose_draws() would be good. The main reason is that for HMC/NUTS diagnostics, we have a per-draw info such as divergences etc. not related to any particular variable, while the evaluation in summarize_draws is per variable. In other words, a natural interface would be different I think, which favours the usage of different methods.
The per-variable diagnostics like rhat etc. can nicely be handled in summarize_draws() and I would argue it should stay that way. The main question is how to represent the HMC/NUTS diagnostics. I think that (a) we should store them in some (to-be) protected meta-columns as we do with .log_weights and (b) we need a specific method to evaluate them. This could be called diagnose_draws but I think the name would be too general, given that other diagnostics would still be evaluated in summarize_draws() . Something like diagnose_hmc_draws() or so could be specific enough I think.
In theory, we can control 3 things abouts the convergence warnings:
- which warnings to show at all in principle?
- what thresholds to use for the warnings which are in principle shown?
- what the content of the warning messages should be?
I think we all agree that points 1 and 2 should be controlled via the warning object. The question is whether we should give users or developers control over the point 3 (warning message content)? Right now, I think we should start with some fixed warning text, but I at least wanted to mention it so that we can discuss if necessary.
@paul-buerkner summarized the discussion well. Couple additional comments
The main reason is that for HMC/NUTS diagnostics, we have a per-draw info such as divergences etc. not related to any particular variable
This holds also for R* which is a generic diagnostic.
- which warnings to show at all in principle?
- what thresholds to use for the warnings which are in principle shown?
- what the content of the warning messages should be?
- I was thinking also that we have a separate function for each diagnostic check (even if the diagnostic is pre-computed) so that package developers can pick which functions to call at which point. In addition it's useful to have a function that runs a set of functions controlled via the warning object. 2+3. I was thinking that in the code we use variables for the warning thresholds and warning texts, and all the warning thresholds and texts are defined in one configuration file so that it would be easy to see all thresholds and texts and also if editing (or translating them) them there is no need to search all code files. Also this would be useful as we might want to have very short warning text and a bit more verbose one.
In addition of short and longer warning text, we might also consider for example for Rhat that a short warning just says some Rhat is too high, but an additional information would give loo khat type table of how many are exceeding different thresholds.
Also for each diagnostic check / warning we would have a link to web page with more information how to interpret the diagnostic warnings, what might be the next steps, etc.
Just a quick note from our (@jgabry; @andrewgelman; @avehtari) call today. We decided that it might be a good thing to have all detailed recommendations following warnings (e.g., divergent transitions, rhat etc) on web pages rather than in the warning messages themselves. The latter would then just point to the web pages. This way, the warning interface becomes much simpler and we can update our recommendations easier without going through CRAN submission process.
Landed here after looking for existing issues requesting a diagnose draws feature, so commenting to lend support to diagnose_hmc_draws()
In thinking about the content of the output of diagnose_hmc_draws(), the broad outlines of what to report about divergences and e-FMI seem pretty straightforward, but treedepth less so.
- Stan's treedepth warnings are specific to however the user set
max_treedepth. - If a Stan model passes all other diagnostics but hits the
max_treedepth, current Stan docs suggest that this isn't necessarily a concern. And this will be especially true if the user specifies a smallmax_treedepth. - If the user specifies a huge
max_treedepthand a huge treedepth is reached, maybe it would be a good idea to warn even if the the largest realized treedepth is lower thanmax_treedepth. Like say the user specifiesmax_treedepth = 20and reaches treedepths of 19. - I'm not sure if it's even possible to tell from Stan csv output whether
max_treedepthwas exceeded (i.e. the integration was halted prematurely) versus whethermax_treedepthwas merely reached (and the stopping criterion was triggered during this themax_treedepthportion of the trajectory). Anybody wanna set me straight about this?
I'm not sure if it's even possible to tell from Stan csv output whether max_treedepth was exceeded (i.e. the integration was halted prematurely) versus whether max_treedepth was merely reached (and the stopping criterion was triggered during this the max_treedepth portion of the trajectory). Anybody wanna set me straight about this?
I don't think that this info is in the standard CSV output, but maybe in the diagnostic files if those are produced?
maybe in the diagnostic files if those are produced
But surely the idea here isn't to parse existing diagnostic files in order to... run diagnostics :P
But surely the idea here isn't to parse existing diagnostic files in order to... run diagnostics
I meant the diagnostic CSV files that are generated by cmdstan during sampling if you ask for them. They do not compute any summary diagnostics, but instead gives you the full information of the trajectories, including rejected ones if I recall.
Weird, can't seem to get the formatting above to work.
I'm actually working on a workflow package that takes a different approach than is currently employed in cmdstanr and I'd already been considering generating and parsing the diagnostic CSVs by default, which would render the pertinent info available to post-sampling packages like posterior. (Though n.b. the eventual goal is to run checks during sampling to enable termination iff passed)
But another thought is: how does cmdstan's diagnose binary produce treedepth warnings without the diagnostic CSVs? Or am I misremembering that it does so?
But another thought is: how does cmdstan's diagnose binary produce treedepth warnings without the diagnostic CSVs?
CmdStan's bin/diagnose doesn't need the diagnostic files, just the standard csv output files. The csv comments contain the the max depth option used (either specified by the user or the default).
I'd already been considering generating and parsing the diagnostic CSVs by default
I think it's good to have that option but it shouldn't be the default (of course it's up to you since you're doing it in your own package!). The extra diagnostic files basically double the size of what you're writing to disk, which is fine for small models but risky as a default. The information in them is also only useful if you need info about the hmc trajectories, which isn't never but it's not the case for a typical user (it's not even accessible from most interfaces).
which isn't never but it's not the case for a typical user
Although perhaps you're not intending your aria package for the "typical" user, more for power users like yourself?
The csv comments contain the the max depth option used
But as @jsocolar notes, couldn't you reach max treedepth without actually exceeding it? Or I guess the diagnostic is about % reaching not % exceeding?
Yeah I think it's % reaching, but it's been a while since I checked so I could be wrong.
The extra diagnostic files basically double the size of what you're writing to disk
Yup, I know; I'm also capturing all info and writing to a proper compressed binary format (netcdf4 with the InferenceData spec), so the files will be deleted when cmdstan is done with them. I'm also looking into hacks that either redirect the writing or straight delete content between cmdstan's writes, but that's longer term.
Interesting. I'll be curious to follow the progress!
Yeah I think it's % reaching, but it's been a while since I checked so I could be wrong.
Just confirmed this is true. Never realized this! One implication is that if a fraction even moderately less than 1 of transitions hit the default max_treedepth of 10, it is likely that very few or none of the transitions actually got terminated prematurely, contrary to what I think most users would infer from the existing warnings.
Wonder if we could model the step sizes with truncation to estimate the CDF and thereby % >= max? Anyone know if there's any theory on the expected distribution of step sizes for well-behaved chains?
For some of my models with a lot of parameters, treedepth has almost no variability, even if the step-size is right near the border between two treedepths. Check this out: https://discourse.mc-stan.org/t/issue-with-dual-averaging/5995/48