ommprotocol icon indicating copy to clipboard operation
ommprotocol copied to clipboard

CUDA error (700)

Open bbrouwr opened this issue 6 years ago • 3 comments

Hi!

I'm experiencing troubles when running MD simulations on metalloprotein (LmrR, 139k atoms in system) with OMMprotocol. So I keep randomly getting this CUDA error (700) and due to this sometimes I have to restart even multiple times to get the simulation finished. In other cases I'm lucky and it finishes without problems. But it's quite annoying, do you happen to know any solutions for this error?

Thanks, BBrouwer

An error occurred: Error invoking kernel: CUDA error (700) Saving state... FAILED :( Traceback (most recent call last): File "/QFsoft/applic/python/conda/envs/openmm-7.1/bin/ommprotocol", line 11, in load_entry_point('ommprotocol==0.1.10', 'console_scripts', 'ommprotocol')() File "/QFsoft/applic/python/conda/envs/openmm-7.1/lib/python3.6/site-packages/ommprotocol-0.1.10-py3.6.egg/ommprotocol/init.py", line 63, in run_protocol File "/QFsoft/applic/python/conda/envs/openmm-7.1/lib/python3.6/site-packages/ommprotocol-0.1.10-py3.6.egg/ommprotocol/md.py", line 85, in protocol File "/QFsoft/applic/python/conda/envs/openmm-7.1/lib/python3.6/site-packages/ommprotocol-0.1.10-py3.6.egg/ommprotocol/md.py", line 316, in run File "/QFsoft/applic/python/conda/envs/openmm-7.1/lib/python3.6/site-packages/ommprotocol-0.1.10-py3.6.egg/ommprotocol/md.py", line 345, in simulate File "/QFsoft/applic/python/conda/envs/openmm-7.1/lib/python3.6/site-packages/simtk/openmm/app/simulation.py", line 132, in step self._simulate(endStep=self.currentStep+steps) File "/QFsoft/applic/python/conda/envs/openmm-7.1/lib/python3.6/site-packages/simtk/openmm/app/simulation.py", line 194, in _simulate self.integrator.step(10) # Only take 10 steps at a time, to give Python more chances to respond to a control-c. File "/QFsoft/applic/python/conda/envs/openmm-7.1/lib/python3.6/site-packages/simtk/openmm/openmm.py", line 17086, in step return _openmm.LangevinIntegrator_step(self, steps) Exception: Error invoking kernel: CUDA error (700)

bbrouwr avatar Jun 13 '19 09:06 bbrouwr

This is more related to openmm than to ommprotocol itself. We have seen this error several times for several reasons, but it's normally due to driver/cuda runtime incompatibilities, faulty GPUs, etc.

See:

  • https://github.com/pandegroup/openmm/issues/1728
  • https://github.com/pandegroup/openmm/issues/820

I think I included a patch to enable disablePmeStream by default, but if not, you can manually specify it in this part of your input:

platform_properties:
    Precision: mixed
    DisablePmeStream: 'true'

(quotes are needed because OpenMM expects a str not a bool)

jaimergp avatar Jun 13 '19 15:06 jaimergp

Thank you!

To be sure, I will include the DisablePmeStream: 'true' and see if this helps.

bbrouwr avatar Jun 17 '19 10:06 bbrouwr

Hi,

Unfortunately, I am reproducing this old error while running MD simulations using OpenMM on Summit. Minimization finishes successfully but simulation does not start. I installed OpenMM using the instructions at https://github.com/inspiremd/conda-recipes-summit. My error message looks like this:

min step 0: 10.0
min end: 0.0
eq1 start: 1.0
Traceback (most recent call last):
  File "../sim.py", line 75, in <module>
    simulation.step(1500)
  File "/ccs/home/apbhati/miniconda/envs/openmm/lib/python3.7/site-packages/simtk/openmm/app/simulation.py", line 132, in step
    self._simulate(endStep=self.currentStep+steps)
  File "/ccs/home/apbhati/miniconda/envs/openmm/lib/python3.7/site-packages/simtk/openmm/app/simulation.py", line 197, in _simulate
    self.integrator.step(10) # Only take 10 steps at a time, to give Python more chances to respond to a control-c.
  File "/ccs/home/apbhati/miniconda/envs/openmm/lib/python3.7/site-packages/simtk/openmm/openmm.py", line 9475, in step
    return _openmm.LangevinIntegrator_step(self, steps)
Exception: Error invoking kernel: CUDA_ERROR_ILLEGAL_ADDRESS (700)
terminate called after throwing an instance of 'OpenMM::OpenMMException'
  what():  Error deleting array bondParams: CUDA_ERROR_ILLEGAL_ADDRESS (700)
Aborted (core dumped)

This is the first time I am getting such an error while trying to add harmonic restraints using CustomExternalForce. Interestingly, I do not get this error when I comment out all the lines related to harmonic restraints using the exact same input files. I am not sure what is causing it and how to get rid of it. I tried including DisablePmeStream: 'true' in my input script, but that does not help. Can anyone please help me with this?

I have attached my script. Thank you.

Best, Agastya openmm.zip

apbhati1 avatar May 02 '20 12:05 apbhati1