westpa icon indicating copy to clipboard operation
westpa copied to clipboard

HUB for better error messages discussion

Open jeremyleung521 opened this issue 5 years ago • 7 comments

This is part of an initiative to improve error messages so end users can better know how to debug their simulations.

This post will act as a hub where you can introduce new error messages you'd like improved. It will also link to other issues on specific error messages for further discussion and present potential fixes. You may present additional situations where those errors might appear in those issues.

Simulation Setup ("Compile Time")

  • Pcoord shape error: #164
  • assert abs(1 - norm) < EPS*(len(segments)+n_active_bins): https://github.com/westpa/westpa/issues/163#issuecomment-1126798421
  • Failure when paths listed in west.cfg do not exist: https://github.com/westpa/westpa/issues/163#issuecomment-1126798648

Running a Simulation ("Run Time")

  • Divide by zero/out of list error: #166
  • Propagation fail: https://github.com/westpa/westpa/issues/163#issuecomment-1126797210

Simulation Analysis

  • KeyError/Hash Error: #168
  • Value out of bin boundaries: https://github.com/westpa/westpa/issues/163#issuecomment-1126799431

The other/older thread

  • #26

jeremyleung521 avatar Apr 14 '21 19:04 jeremyleung521

I've noticed there is an older issue thread #26 on improving error messages that was never merged. (https://github.com/westpa/westpa/tree/error_handling).

I haven't had the chance to go through everything but I think it's definitely worth implementing some of those changes (that I've read) into WESTPA 2.0, especially if they fix some of the issues listed here.

jeremyleung521 avatar Apr 26 '21 13:04 jeremyleung521

I have been running into this vague error message when trying to analyze simulations with w_ipa (or even with w_assign by itself):

(westpa-2.0-test) ~/Documents/odld$ w_ipa                                                                                                                                                                                              
Welcome to w_ipa (WESTPA Interactive Python Analysis) v. 1.0B!                                                          
Run w.introduction for a more thorough introduction, or w.help to see a list of options.                                       
Running analysis & loading files.                                                                                              
Reanalyzing file assign.h5 for scheme DEFAULT.
Traceback (most recent call last):
  File "/home/atb43/apps/anaconda3/envs/westpa-2.0-test/bin/w_ipa", line 33, in <module>
    sys.exit(load_entry_point('westpa', 'console_scripts', 'w_ipa')())
  File "/home/atb43/Documents/westpa/src/westpa/cli/tools/w_ipa.py", line 815, in entry_point
    w.main()    11%  [===========                                                                                             ]
  File "/home/atb43/Documents/westpa/src/westpa/tools/core.py", line 171, in main
    self.go()
  File "/home/atb43/Documents/westpa/src/westpa/cli/tools/w_ipa.py", line 733, in go
    self.analysis_structure()
  File "/home/atb43/Documents/westpa/src/westpa/cli/tools/w_ipa.py", line 324, in analysis_structure
    assign.go()
  File "/home/atb43/Documents/westpa/src/westpa/cli/tools/w_assign.py", line 574, in go
    assignments, trajlabels, pops, statelabels = self.assign_iteration(n_iter, nstates, nbins, state_map, last_labels)
  File "/home/atb43/Documents/westpa/src/westpa/cli/tools/w_assign.py", line 443, in assign_iteration
    assign_slice, traj_slice, slice_pops, lb, ub, state_slice = future.get_result(discard=True)
  File "/home/atb43/Documents/westpa/src/westpa/work_managers/core.py", line 334, in get_result
    raise self._exception.with_traceback(self._traceback)
TypeError: __traceback__ must be a traceback or None

I know I've seen this before somewhere, the issue ends up being from the fact that I have pcoord values that fall outside of the analysis bins I've created. For instance, I set my analysis bins to be [0, 2, 9, 10] and if I run w_ipa or w_assign I'll get that error. But setting my analysis bins to [0, 2, 9, 10, 'inf'] fixes the issue, suggesting that I have walkers with pcoord values greater than 10. It would be nice to have an error message here letting users know why the crash happened and maybe how to fix it. Posting this here since it may be part of a bigger-picture error message we want to enhance in the analysis tools and I don't know enough about it to open a separate issue.

AnthonyBogetti avatar Oct 13 '21 20:10 AnthonyBogetti

Repost of #167

Propagation Failed

Sample Error: sim_manager.py", line 696, in check_propagation "propagation failed for {:d} segments".format(len(failed_segments)) westpa.sim_manager.PropagationError: propagation failed for 14 segments

Relevant Code: https://github.com/westpa/westpa/blob/main/src/west/sim_manager.py#L524

Potential Causes:

  • Error on the propagator end or resources end so the propagation failed
  • Something wrong with Sim manager?

Potential Fixes:

  1. ??

jeremyleung521 avatar May 14 '22 19:05 jeremyleung521

repost of #165

Weights are not normalized assertion error

See that issue for some of Matt and John's discussion on it.

Sample Error: assert abs(1 - norm) < EPS*(len(segments)+n_active_bins)

Relevant Code: https://github.com/westpa/westpa/blob/main/src/west/sim_manager.py#L143

Potential Causes:

  • Tolerance in norm is not the most rigorous (https://groups.google.com/g/westpa-users/c/hu8UL6Ig2e4/m/7Bp1Y-556U4J)
  • Custom system.py is causing problems where walkers are being removed
  • bstate and tstate are overlapped so any walker is immediately recycled, causing weights to double
  • "loss" of probability just from accumulated floating point error https://github.com/westpa/westpa/issues/165#issuecomment-947066531

Potential Fixes:

  1. Last point is fixed with explicit re-normalization in w_init (#214)

jeremyleung521 avatar May 14 '22 19:05 jeremyleung521

repost of #185

More informative error message when paths in west.cfg do not exist

by jdrusso

Sample Error:

System is being built only off of the system driver
Restart plugin initialized
Maximum wallclock time: 14 days, 0:00:00

Mon Aug  9 16:34:09 2021
Iteration 1 (250 requested)
Beginning iteration 1
4 segments remain in iteration 1 (4 total)
1 of 52 (1.923077%) active bins are populated
per-bin minimum non-zero probability:       1
per-bin maximum probability:                1
per-bin probability dynamic range (kT):     0
per-segment minimum non-zero probability:   0.25
per-segment maximum non-zero probability:   0.25
per-segment probability dynamic range (kT): 0
norm = 1, error in norm = 0 (0*epsilon)
exception caught; shutting down
-- ERROR    [w_run] -- Traceback (most recent call last):
  File "/home/groups/ZuckermanLab/russojd/westpa/westpa/src/westpa/cli/core/w_run.py", line 62, in run_simulation
    sim_manager.run()
  File "/home/groups/ZuckermanLab/russojd/westpa/westpa/src/westpa/core/sim_manager.py", line 695, in run
    self.propagate()
  File "/home/groups/ZuckermanLab/russojd/westpa/westpa/src/westpa/core/sim_manager.py", line 540, in propagate
    incoming = future.get_result()
  File "/home/groups/ZuckermanLab/russojd/westpa/westpa/src/westpa/work_managers/core.py", line 334, in get_result
    raise self._exception.with_traceback(self._traceback)
TypeError: __traceback__ must be a traceback or None

Relevant code:

Potential causes:

When the paths provided in west.cfg don't exist, it can cause all sorts of arcane error messages that completely depend on the propagator.

The above is an example of using the executable propagator w/ AMBER, when seg_logs and traj_segs don't exist. w_init even runs fine, but when you w_run you get the above.

Potential fixes:

Ideally, a more meaningful traceback should get passed back (not sure why it doesn't get passed back up here).

A really nice check would be making sure anything that's supposed to be a path in west.cfg is in fact a valid, existing path.

self.segment_ref_template, self.basis_state_ref_template, and self.initial_state_ref_template should be validated as paths whenever they're set in westpa.rc. Not sure where this happens

Similarly, when loading in stdout/stderr/stderr/cwd at https://github.com/westpa/westpa/blob/westpa-2.0-restruct/src/westpa/core/propagators/executable.py#L142 we should use yamlcfg.YAMLConfig.get_path instead of get, which I think will handle validating the paths.

jeremyleung521 avatar May 14 '22 19:05 jeremyleung521

repost of #170

w_pdist ValueError: Value out of bin boundaries

by jdrusso

Sample Error:

Traceback (most recent call last):
  ....
  File "/Users/russojd/anaconda3/envs/westpa-2020/westpa-2020.03/lib/west_tools/w_pdist.py", line 53, in _remote_bin_iter
    histnd(dset[:,ipt,:], binbounds, weights, out=iter_hist, binbound_check = False, ignore_out_of_range=ignore_out_of_range)
  File "fasthist/_fasthist.pyx", line 39, in fasthist._fasthist.histnd
  File "fasthist/_fasthist.pyx", line 101, in fasthist._fasthist.histnd
  File "fasthist/_fasthist.pyx", line 226, in fasthist._fasthist._histnd
ValueError: value nan at index 194 out of bin boundaries in dimension 0

Relevant code:

w_pdist

Potential causes:

Walker blowing up

This simulation was run for 3841 iterations. I managed to bracket in the problematic iteration to iteration 2752 by tweaking --first-iter and --last-iter. Here, row 191 of west.h5/iterations/iter_00002752 shows that after 4 steps, this particular walker blew up and started reporting an infinity and then NaNs. This walker continued to report NaNs in subsequent iterations.

Potential fixes:

This error should more explicitly indicate that a walker has blown up.

However, this does raise some additional questions -- w_fluxanl ran with no complaints on this data, and WESTPA apparently encountered these NaNs and silently kept running. Seems like we should at least raise a warning in the output of every iteration if there's a blown-up walker.

Additional Comments:

by jdrusso

After corresponding a bit with atbogetti it seems like indeed, WESTPA checks for explicit failure (i.e. crashing outright) of the propagator, but not for sensible values of the progress coordinate. I propose a simple check in the post-iteration for NaN or inf values. I'd love to hear thoughts on whether this would make more sense as a warning (one of your walkers has failed, proceed with caution) or an error (your dynamics are blowing up, fix it and start over)

jeremyleung521 avatar May 14 '22 19:05 jeremyleung521