ColabFold icon indicating copy to clipboard operation
ColabFold copied to clipboard

CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered while running locally

Open wlawler45 opened this issue 3 years ago • 6 comments

Expected Behavior

Expected model to run correctly on lo

Current Behavior

After updating using the update script presented in the Local Colabfold install github yesterday suddenly I can't run any models to completion anymore, and after many hours of trying different things I keep getting this error. I'm not sure what changed.

Steps to Reproduce (for bugs)

I'm running colab for a dimer with less than 1200 AA total. I have a GTX RT 2070, cudatoolkit 11.7. should be most updated nvidia drivers as well.

ColabFold Output (for bugs)

2022-06-21 21:56:13,838 Running model_3 2022-06-21 22:13:52,690 model_3 took 1058.9s (3 recycles) with pLDDT 60.7, ptmscore 0.405 and iptm 0.617 2022-06-21 22:18:56,407 Relaxation took 256.4s 2022-06-21 22:18:56,408 Running model_4 2022-06-21 22:19:18.538992: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2141] Execution of replica 0 failed: INTERNAL: Failed to launch CUDA kernel: fusion_99 with block dimensions: 256x1x1 and grid dimensions: 18907x1x1: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered Traceback (most recent call last): File "/home/williamubuntu/colabfold_batch/colabfold-conda/bin/colabfold_batch", line 8, in sys.exit(main()) File "/home/williamubuntu/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 1752, in main stop_at_score_below=args.stop_at_score_below, File "/home/williamubuntu/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 1385, in run random_seed=random_seed, File "/home/williamubuntu/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 359, in predict_structure prediction_result, recycles = model_runner.predict(input_features) File "/home/williamubuntu/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/alphafold/model/model.py", line 189, in predict result, _ = self.apply(self.params, key, sub_feat) ValueError: INTERNAL: Failed to launch CUDA kernel: fusion_99 with block dimensions: 256x1x1 and grid dimensions: 18907x1x1: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered 2022-06-21 22:19:19.339355: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1047] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace ***

_PyModule_ClearDict PyImport_Cleanup Py_FinalizeEx

_Py_UnixMain
__libc_start_main

*** End stack trace ***

2022-06-21 22:19:19.339500: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:284] Check failed: pair.first->SynchronizeAllActivity() Aborted (core dumped)

Your Environment

Linux, Ubuntu 20.04 Let me know if I'm missing any details you might need.

wlawler45 avatar Jun 22 '22 02:06 wlawler45

I also encounter such an issue but is not reproducible. The python packages and drivers you are using do not seem to have any problems, since the calculations are working fine up to model_3. In my case, the calculation was completed without any issues when I restarted it. Have you tried the same prediction again?

YoshitakaMo avatar Jun 22 '22 05:06 YoshitakaMo

Can you share the input sequence and command line parameters you used (if any)?

milot-mirdita avatar Jun 22 '22 09:06 milot-mirdita

Here are the command line parameters I used. My error is very reproducible, I've run it many times now and still gotten the same result.

colabfold_batch --amber --templates --num-recycle 3 --num-models 3 --use-gpu-relax --model-type AlphaFold2-multimer-v2 fasta_files/EAAAK5.fasta ./EAAAK5

I've tried with and without num_models as well.

wlawler45 avatar Jun 22 '22 14:06 wlawler45

Does my input sequence drastically effect the ability of the model to generate a result?

wlawler45 avatar Jun 22 '22 14:06 wlawler45

It would allow us to try to reproduce the issue.

I've encountered one sequence before some time ago that crashed on rtx5000 but didn't on v100.

milot-mirdita avatar Jun 23 '22 09:06 milot-mirdita

I am getting the same error. Interestingly, everything works fine for monomer predictions, but as soon as I try multimer prediction the error occurs. Running on 8 cores, 16 gigs of memory per core and 1x P100. Error is also very reproducible.

Have you been able to fix the issue?

amlentzsch avatar Jul 19 '22 23:07 amlentzsch