CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while running locally
Expected Behavior
Expected model to run correctly on lo
Current Behavior
After updating using the update script presented in the Local Colabfold install github yesterday suddenly I can't run any models to completion anymore, and after many hours of trying different things I keep getting this error. I'm not sure what changed.
Steps to Reproduce (for bugs)
I'm running colab for a dimer with less than 1200 AA total. I have a GTX RT 2070, cudatoolkit 11.7. should be most updated nvidia drivers as well.
ColabFold Output (for bugs)
2022-06-21 21:56:13,838 Running model_3
2022-06-21 22:13:52,690 model_3 took 1058.9s (3 recycles) with pLDDT 60.7, ptmscore 0.405 and iptm 0.617
2022-06-21 22:18:56,407 Relaxation took 256.4s
2022-06-21 22:18:56,408 Running model_4
2022-06-21 22:19:18.538992: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2141] Execution of replica 0 failed: INTERNAL: Failed to launch CUDA kernel: fusion_99 with block dimensions: 256x1x1 and grid dimensions: 18907x1x1: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
File "/home/williamubuntu/colabfold_batch/colabfold-conda/bin/colabfold_batch", line 8, in
_PyModule_ClearDict PyImport_Cleanup Py_FinalizeEx
_Py_UnixMain
__libc_start_main
*** End stack trace ***
2022-06-21 22:19:19.339500: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:284] Check failed: pair.first->SynchronizeAllActivity() Aborted (core dumped)
Your Environment
Linux, Ubuntu 20.04 Let me know if I'm missing any details you might need.
I also encounter such an issue but is not reproducible. The python packages and drivers you are using do not seem to have any problems, since the calculations are working fine up to model_3. In my case, the calculation was completed without any issues when I restarted it. Have you tried the same prediction again?
Can you share the input sequence and command line parameters you used (if any)?
Here are the command line parameters I used. My error is very reproducible, I've run it many times now and still gotten the same result.
colabfold_batch --amber --templates --num-recycle 3 --num-models 3 --use-gpu-relax --model-type AlphaFold2-multimer-v2 fasta_files/EAAAK5.fasta ./EAAAK5
I've tried with and without num_models as well.
Does my input sequence drastically effect the ability of the model to generate a result?
It would allow us to try to reproduce the issue.
I've encountered one sequence before some time ago that crashed on rtx5000 but didn't on v100.
I am getting the same error. Interestingly, everything works fine for monomer predictions, but as soon as I try multimer prediction the error occurs. Running on 8 cores, 16 gigs of memory per core and 1x P100. Error is also very reproducible.
Have you been able to fix the issue?