yank icon indicating copy to clipboard operation
yank copied to clipboard

Failure to set up a single molecule terminates entire MPI calculation

Open jchodera opened this issue 5 years ago • 1 comments

I'm working on adding an example to set up and run the entire FreeSolv set (in this branch), and noticed that this molecule takes out the entire calculation:

@<TRIPOS>MOLECULE
1,2,3,4,5-pentachloro-6-nitro-benzene
   14    14     0     0     0
SMALL
USER_CHARGES

@<TRIPOS>ATOM
      1 C1          1.8851   -1.0363   -0.1124 C.ar      1 B68        -0.1530
      2 C2          2.9210   -1.6309    0.6081 C.ar      1 B68         0.0694
      3 C3          2.8783   -1.6520    2.0022 C.ar      1 B68         0.0079
      4 C4          1.7997   -1.0786    2.6757 C.ar      1 B68         0.0540
      5 C5          0.7638   -0.4841    1.9551 C.ar      1 B68         0.0079
      6 C6          0.8066   -0.4630    0.5610 C.ar      1 B68         0.0694
      7 Cl1        -0.4785    0.2740   -0.3566 Cl        1 B68         0.0037
      8 Cl2        -0.5854    0.2338    2.7930 Cl        1 B68        -0.0124
      9 Cl3         1.7467   -1.1045    4.4182 Cl        1 B68        -0.0098
     10 Cl4         4.1749   -2.3959    2.8985 Cl        1 B68        -0.0124
     11 Cl5         4.2593   -2.3421   -0.2517 Cl        1 B68         0.0037
     12 N1          1.9282   -1.0154   -1.5413 N.pl3     1 B68         0.3184
     13 O1          0.9853   -0.4765   -2.1605 O.2       1 B68        -0.1734
     14 O2          2.9073   -1.5364   -2.1181 O.2       1 B68        -0.1734
@<TRIPOS>BOND
     1    1    6 ar
     2    1    2 ar
     3    2    3 ar
     4    3    4 ar
     5    4    5 ar
     6    5    6 ar
     7    6    7 1
     8    5    8 1
     9    4    9 1
    10    3   10 1
    11    2   11 1
    12    1   12 1
    13   12   13 2
    14   12   14 2

The antechamber error (with AmberTools 18) is:

Welcome to antechamber 17.3: molecular input file processor.

acdoctor mode is on: check and diagnosis problems in the input file.
-- Check Format for mol2 File --
   Status: pass
Info: Finished reading file (in.mol2).
-- Check Unusual Elements --
   Status: pass
-- Check Open Valences --
   Status: pass
-- Check Geometry --
      for those bonded   
      for those not bonded   
   Status: pass
-- Check Weird Bonds --
/lustre/atlas/proj-shared/chm126/yank/jchodera1/miniconda3/bin/to_be_dispatched/antechamber: Fatal Error!
Weird atomic valence (5) for atom (ID: 12, Name: N1).
       Please check atomic connectivity.

This triggers termination of the entire simulation with

2018-09-08 23:31:32,496: ERROR - yank.mpi - MPI node 1190/1300 raised exception.
NoneType: None
2018-09-08 23:31:32,499: CRITICAL - yank.mpi - MPI node 1190/1300 called Abort()!
Rank 1189 [Sat Sep  8 23:31:32 2018] [c10-7c0s7n0] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1189
Traceback (most recent call last):
  File "/var//opt/cray/alps/spool/18765043/yank", line 11, in <module>
    load_entry_point('yank==0.23.4', 'console_scripts', 'yank')()
  File "/lustre/atlas/proj-shared/chm126/yank/jchodera1/miniconda3/lib/python3.6/site-packages/yank/cli.py", line 73, in main
    dispatched = getattr(commands, command).dispatch(command_args)
  File "/lustre/atlas/proj-shared/chm126/yank/jchodera1/miniconda3/lib/python3.6/site-packages/yank/commands/script.py", line 138, in dispatch
    yaml_builder.run_experiments()
  File "/lustre/atlas/proj-shared/chm126/yank/jchodera1/miniconda3/lib/python3.6/site-packages/yank/experiment.py", line 774, in run_experiments
    self._setup_experiments()
  File "/lustre/atlas/proj-shared/chm126/yank/jchodera1/miniconda3/lib/python3.6/site-packages/yank/experiment.py", line 2411, in _setup_experiments
    self._db.setup_all_systems()
  File "/lustre/atlas/proj-shared/chm126/yank/jchodera1/miniconda3/lib/python3.6/site-packages/yank/pipeline.py", line 1382, in setup_all_systems
    send_results_to=None, group_size=1, sync_nodes=True)
  File "/lustre/atlas/proj-shared/chm126/yank/jchodera1/miniconda3/lib/python3.6/site-packages/yank/mpi.py", line 511, in distribute
    *other_args, **kwargs)
  File "/lustre/atlas/proj-shared/chm126/yank/jchodera1/miniconda3/lib/python3.6/site-packages/yank/mpi.py", line 364, in exec_tasks
    raise error
  File "/lustre/atlas/proj-shared/chm126/yank/jchodera1/miniconda3/lib/python3.6/site-packages/yank/mpi.py", line 349, in exec_tasks
    results.append(task(distributed_arg, *other_args, **kwargs))
  File "/lustre/atlas/proj-shared/chm126/yank/jchodera1/miniconda3/lib/python3.6/site-packages/yank/pipeline.py", line 1689, in _setup_molecules
    gaff_version=gaff)
  File "/lustre/atlas/proj-shared/chm126/yank/jchodera1/miniconda3/lib/python3.6/site-packages/openmoltools/amber.py", line 372, in run_antechamber
_pmiu_daemon(SIGCHLD): [NID 04157] [c10-7c1s1n3] [Sat Sep  8 23:31:32 2018] PE RANK 1299 exit signal Aborted

We should, by default, progress through these failures and log them rather than taking out the whole calculation.

jchodera avatar Sep 09 '18 14:09 jchodera