HUB for better error messages discussion
This is part of an initiative to improve error messages so end users can better know how to debug their simulations.
This post will act as a hub where you can introduce new error messages you'd like improved. It will also link to other issues on specific error messages for further discussion and present potential fixes. You may present additional situations where those errors might appear in those issues.
Simulation Setup ("Compile Time")
-
Pcoord shape error: #164 - assert abs(1 - norm) < EPS*(len(segments)+n_active_bins): https://github.com/westpa/westpa/issues/163#issuecomment-1126798421
- Failure when paths listed in west.cfg do not exist: https://github.com/westpa/westpa/issues/163#issuecomment-1126798648
Running a Simulation ("Run Time")
-
Divide by zero/out of list error: #166 - Propagation fail: https://github.com/westpa/westpa/issues/163#issuecomment-1126797210
Simulation Analysis
-
KeyError/Hash Error: #168 - Value out of bin boundaries: https://github.com/westpa/westpa/issues/163#issuecomment-1126799431
The other/older thread
- #26
I've noticed there is an older issue thread #26 on improving error messages that was never merged. (https://github.com/westpa/westpa/tree/error_handling).
I haven't had the chance to go through everything but I think it's definitely worth implementing some of those changes (that I've read) into WESTPA 2.0, especially if they fix some of the issues listed here.
I have been running into this vague error message when trying to analyze simulations with w_ipa (or even with w_assign by itself):
(westpa-2.0-test) ~/Documents/odld$ w_ipa
Welcome to w_ipa (WESTPA Interactive Python Analysis) v. 1.0B!
Run w.introduction for a more thorough introduction, or w.help to see a list of options.
Running analysis & loading files.
Reanalyzing file assign.h5 for scheme DEFAULT.
Traceback (most recent call last):
File "/home/atb43/apps/anaconda3/envs/westpa-2.0-test/bin/w_ipa", line 33, in <module>
sys.exit(load_entry_point('westpa', 'console_scripts', 'w_ipa')())
File "/home/atb43/Documents/westpa/src/westpa/cli/tools/w_ipa.py", line 815, in entry_point
w.main() 11% [=========== ]
File "/home/atb43/Documents/westpa/src/westpa/tools/core.py", line 171, in main
self.go()
File "/home/atb43/Documents/westpa/src/westpa/cli/tools/w_ipa.py", line 733, in go
self.analysis_structure()
File "/home/atb43/Documents/westpa/src/westpa/cli/tools/w_ipa.py", line 324, in analysis_structure
assign.go()
File "/home/atb43/Documents/westpa/src/westpa/cli/tools/w_assign.py", line 574, in go
assignments, trajlabels, pops, statelabels = self.assign_iteration(n_iter, nstates, nbins, state_map, last_labels)
File "/home/atb43/Documents/westpa/src/westpa/cli/tools/w_assign.py", line 443, in assign_iteration
assign_slice, traj_slice, slice_pops, lb, ub, state_slice = future.get_result(discard=True)
File "/home/atb43/Documents/westpa/src/westpa/work_managers/core.py", line 334, in get_result
raise self._exception.with_traceback(self._traceback)
TypeError: __traceback__ must be a traceback or None
I know I've seen this before somewhere, the issue ends up being from the fact that I have pcoord values that fall outside of the analysis bins I've created. For instance, I set my analysis bins to be [0, 2, 9, 10] and if I run w_ipa or w_assign I'll get that error. But setting my analysis bins to [0, 2, 9, 10, 'inf'] fixes the issue, suggesting that I have walkers with pcoord values greater than 10. It would be nice to have an error message here letting users know why the crash happened and maybe how to fix it. Posting this here since it may be part of a bigger-picture error message we want to enhance in the analysis tools and I don't know enough about it to open a separate issue.
Repost of #167
Propagation Failed
Sample Error: sim_manager.py", line 696, in check_propagation "propagation failed for {:d} segments".format(len(failed_segments)) westpa.sim_manager.PropagationError: propagation failed for 14 segments
Relevant Code: https://github.com/westpa/westpa/blob/main/src/west/sim_manager.py#L524
Potential Causes:
- Error on the propagator end or resources end so the propagation failed
- Something wrong with Sim manager?
Potential Fixes:
- ??
repost of #165
Weights are not normalized assertion error
See that issue for some of Matt and John's discussion on it.
Sample Error:
assert abs(1 - norm) < EPS*(len(segments)+n_active_bins)
Relevant Code: https://github.com/westpa/westpa/blob/main/src/west/sim_manager.py#L143
Potential Causes:
- Tolerance in norm is not the most rigorous (https://groups.google.com/g/westpa-users/c/hu8UL6Ig2e4/m/7Bp1Y-556U4J)
- Custom system.py is causing problems where walkers are being removed
- bstate and tstate are overlapped so any walker is immediately recycled, causing weights to double
-
"loss" of probability just from accumulated floating point errorhttps://github.com/westpa/westpa/issues/165#issuecomment-947066531
Potential Fixes:
- Last point is fixed with explicit re-normalization in w_init (#214)
repost of #185
More informative error message when paths in west.cfg do not exist
by jdrusso
Sample Error:
System is being built only off of the system driver
Restart plugin initialized
Maximum wallclock time: 14 days, 0:00:00
Mon Aug 9 16:34:09 2021
Iteration 1 (250 requested)
Beginning iteration 1
4 segments remain in iteration 1 (4 total)
1 of 52 (1.923077%) active bins are populated
per-bin minimum non-zero probability: 1
per-bin maximum probability: 1
per-bin probability dynamic range (kT): 0
per-segment minimum non-zero probability: 0.25
per-segment maximum non-zero probability: 0.25
per-segment probability dynamic range (kT): 0
norm = 1, error in norm = 0 (0*epsilon)
exception caught; shutting down
-- ERROR [w_run] -- Traceback (most recent call last):
File "/home/groups/ZuckermanLab/russojd/westpa/westpa/src/westpa/cli/core/w_run.py", line 62, in run_simulation
sim_manager.run()
File "/home/groups/ZuckermanLab/russojd/westpa/westpa/src/westpa/core/sim_manager.py", line 695, in run
self.propagate()
File "/home/groups/ZuckermanLab/russojd/westpa/westpa/src/westpa/core/sim_manager.py", line 540, in propagate
incoming = future.get_result()
File "/home/groups/ZuckermanLab/russojd/westpa/westpa/src/westpa/work_managers/core.py", line 334, in get_result
raise self._exception.with_traceback(self._traceback)
TypeError: __traceback__ must be a traceback or None
Relevant code:
Potential causes:
When the paths provided in west.cfg don't exist, it can cause all sorts of arcane error messages that completely depend on the propagator.
The above is an example of using the executable propagator w/ AMBER, when seg_logs and traj_segs don't exist. w_init even runs fine, but when you w_run you get the above.
Potential fixes:
Ideally, a more meaningful traceback should get passed back (not sure why it doesn't get passed back up here).
A really nice check would be making sure anything that's supposed to be a path in west.cfg is in fact a valid, existing path.
self.segment_ref_template, self.basis_state_ref_template, and self.initial_state_ref_template should be validated as paths whenever they're set in westpa.rc. Not sure where this happens
Similarly, when loading in stdout/stderr/stderr/cwd at https://github.com/westpa/westpa/blob/westpa-2.0-restruct/src/westpa/core/propagators/executable.py#L142 we should use yamlcfg.YAMLConfig.get_path instead of get, which I think will handle validating the paths.
repost of #170
w_pdist ValueError: Value out of bin boundaries
by jdrusso
Sample Error:
Traceback (most recent call last):
....
File "/Users/russojd/anaconda3/envs/westpa-2020/westpa-2020.03/lib/west_tools/w_pdist.py", line 53, in _remote_bin_iter
histnd(dset[:,ipt,:], binbounds, weights, out=iter_hist, binbound_check = False, ignore_out_of_range=ignore_out_of_range)
File "fasthist/_fasthist.pyx", line 39, in fasthist._fasthist.histnd
File "fasthist/_fasthist.pyx", line 101, in fasthist._fasthist.histnd
File "fasthist/_fasthist.pyx", line 226, in fasthist._fasthist._histnd
ValueError: value nan at index 194 out of bin boundaries in dimension 0
Relevant code:
w_pdist
Potential causes:
Walker blowing up
This simulation was run for 3841 iterations. I managed to bracket in the problematic iteration to iteration 2752 by tweaking --first-iter and --last-iter. Here, row 191 of west.h5/iterations/iter_00002752 shows that after 4 steps, this particular walker blew up and started reporting an infinity and then NaNs. This walker continued to report NaNs in subsequent iterations.
Potential fixes:
This error should more explicitly indicate that a walker has blown up.
However, this does raise some additional questions -- w_fluxanl ran with no complaints on this data, and WESTPA apparently encountered these NaNs and silently kept running. Seems like we should at least raise a warning in the output of every iteration if there's a blown-up walker.
Additional Comments:
by jdrusso
After corresponding a bit with atbogetti it seems like indeed, WESTPA checks for explicit failure (i.e. crashing outright) of the propagator, but not for sensible values of the progress coordinate. I propose a simple check in the post-iteration for NaN or inf values. I'd love to hear thoughts on whether this would make more sense as a warning (one of your walkers has failed, proceed with caution) or an error (your dynamics are blowing up, fix it and start over)