ss3sim icon indicating copy to clipboard operation
ss3sim copied to clipboard

Guidance on troubleshooting poor convergence

Open Cole-Monnahan-NOAA opened this issue 4 years ago • 2 comments

Currently the advice for what to do when certain iterations of a scenario don't converge (defined by either big gradients or non-positive definite Hessian) is to simply drop them and run more. I think we need some more formal guidance and practical tips for troubleshooting and fixing these issues. This is based on some conversations I recently had with @kellijohnson-NOAA and @Claudio-Castillo

So I thought this thread could serve as a place to have an open conversation about ideas and tricks people have used.

My understanding is that there are two main reasons for a non-positive-definite (NPD; i.e., non-invertible) Hessian. First, the MLE has a bad geometry like estimates on bounds or a ridge in the likelihood. The gradients will (can) be small but the Hessian NPD. This case may be indicated by having parameters estimated on bounds, which are reported in the scalar output file. A user can check this by manually turning off estimation of these parameters and rerunning to see if the Hessian is invertible. Another option is the function adnuts::check_identifiable which reads in the admodel.hes file and attempts to identify which parameters are causing issues. It's essentially in an alpha stage (buggy, inconsistent) but I hope to improve it and make it more useful for the ADMB community moving forward, so that is worth noting. Jittering or restarting without jittering may also help here.

A second is that the optimizer gets lost in the parameter space and cannot find the MLE (large gradients) and so the Hessian is meaningless, despite a valid MLE existing. One potential cause is bad initial values. SS is not robust to starting values, particularly if they lead to a crashed state. There are crash penalties and such but we should never expect any MLE to be reliable in this case. I think ss3sim is particularly susceptible to this case because of how the initial values are defaulted in the EM. All fixed effects, including the lnR0 parameter, are typically copied from the OM and thus initialized at the true values. In contrast, the recdevs are initialized at 0. This can easily lead to a situation where the EM is initialized at a crashed population because the large cohorts in the OM are not there so the catches extirpate the stock.

The easiest solution is to increase the initial value of lnR0 in the EM (e.g., +2) so the model starts with a much higher biomass and isn't crashed. One trick to check this is to run your scenario normally, which produces a column in the scalar file of "Hessian" of TRUE or FALSE indicating convergence. Then rerun the scenario with a different name but add run_ss3sim(..., admb_options=' -maxfn 0') which tells ADMB not to take any steps during optimization, causing the SS output files to be the stock at the initial values. You can then read the results in and plot the SSB time series to check which are crashed, possible mapping the Hessian field to split them. Here is an example with a stock with a high SigmaR which exacerbates the issue:

ts <- read.csv('ss3sim_ts.csv') %>% filter(year>25)
plot_ts_lines(ts, y='SpawnBio_em') + ggtitle('Init R0=True R0')

image

Note how all iterations are initialized at a crashed population. If we bump up the initial value for lnR0 it improves:

ts <- read.csv('ss3sim_ts_largeR0_inits.csv') %>% filter(year>25)
plot_ts_lines(ts, y='SpawnBio_em') + ggtitle('Init R0=large R0')

image

For those that are crashed, you can manually enter the OM recdevs into the em.ctl file to use as initial values and reoptimize the model from the command line (i.e., go into em folder of iteration N and modify/run things manually). If it convergences nicely then you know that the issue is the in the initial values, not case (1) above. Historically we intentionally avoided using the true recdevs because that is unreasonable in real assessment ("cheating"). I don't know what the solution is but this is an easy, fast way to check whether convergence issues are related to initial values or not which is very valuable information when troubleshooting.

One last thought is that this issue is exacerbated for larger sigmaR values because of the large swings in biomass. Essentially, these models are harder to initialize, especially in the context of a simulation where you cannot play with it manually into it works which is what I suspect authors do in real assessments.

I'm curious to see what others have found is useful for doing this. I also think we developers should provide more guidance and interpretation of how to define and assess convergence.

Cole-Monnahan-NOAA avatar May 12 '20 17:05 Cole-Monnahan-NOAA