alphafold icon indicating copy to clipboard operation
alphafold copied to clipboard

No CUDA device is available - Alphafold2 Multimer Folding Issue

Open akenginorhun opened this issue 2 years ago • 3 comments

Hello everyone,

I've been encountering this "No CUDA device is available" problem whenever I want to fold a multimer. I've folded a multimer before and it was working fine but suddenly I'm having this issue... When it comes to folding a monomer, I've folded a lot of monomers and only encountered the same issue once. So I thought it was something unimportant or temporary. But I see the same error whenever I want to fold a multimer. I've checked the current issue (#403 ) but apparently they couldn't fix the issue neither. Can anyone help me with this?

I0220 01:04:02.916177 140340028417856 run_docker.py:255] I0220 09:04:02.915439 140290817177408 amber_minimize.py:408] Minimizing protein, attempt 1 of 100.
I0220 01:04:05.207375 140340028417856 run_docker.py:255] I0220 09:04:05.206627 140290817177408 amber_minimize.py:69] Restraining 7216 / 14411 particles.
I0220 01:04:05.454775 140340028417856 run_docker.py:255] I0220 09:04:05.453937 140290817177408 amber_minimize.py:418] No compatible CUDA device is available
.
.
(attempts for another 100 times)
.
.
I0220 01:08:34.041026 140340028417856 run_docker.py:255] Traceback (most recent call last):
I0220 01:08:34.041428 140340028417856 run_docker.py:255] File "/app/alphafold/run_alphafold.py", line 432, in <module>
I0220 01:08:34.041648 140340028417856 run_docker.py:255] app.run(main)
I0220 01:08:34.041849 140340028417856 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
I0220 01:08:34.042044 140340028417856 run_docker.py:255] _run_main(main, args)
I0220 01:08:34.042233 140340028417856 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
I0220 01:08:34.042421 140340028417856 run_docker.py:255] sys.exit(main(argv))
I0220 01:08:34.042604 140340028417856 run_docker.py:255] File "/app/alphafold/run_alphafold.py", line 408, in main
I0220 01:08:34.042788 140340028417856 run_docker.py:255] predict_structure(
I0220 01:08:34.042969 140340028417856 run_docker.py:255] File "/app/alphafold/run_alphafold.py", line 243, in predict_structure
I0220 01:08:34.043175 140340028417856 run_docker.py:255] relaxed_pdb_str, _, violations = amber_relaxer.process(
I0220 01:08:34.043358 140340028417856 run_docker.py:255] File "/app/alphafold/alphafold/relax/relax.py", line 62, in process
I0220 01:08:34.043538 140340028417856 run_docker.py:255] out = amber_minimize.run_pipeline(
I0220 01:08:34.043717 140340028417856 run_docker.py:255] File "/app/alphafold/alphafold/relax/amber_minimize.py", line 476, in run_pipeline
I0220 01:08:34.043897 140340028417856 run_docker.py:255] ret = _run_one_iteration(
I0220 01:08:34.044077 140340028417856 run_docker.py:255] File "/app/alphafold/alphafold/relax/amber_minimize.py", line 420, in _run_one_iteration
I0220 01:08:34.044255 140340028417856 run_docker.py:255] raise ValueError(f"Minimization failed after {max_attempts} attempts.")
I0220 01:08:34.044433 140340028417856 run_docker.py:255] ValueError: Minimization failed after 100 attempts.

akenginorhun avatar Feb 20 '23 21:02 akenginorhun

I got the same error using AlphaFold in Docker. For monomer it worked. However, when I tried multimer folding I got the "No compatible CUDA" error followed by the amber minimize issue and the related minimization error.

GiammaFer75 avatar Jun 01 '23 12:06 GiammaFer75

I am also having the same issue. Is there a solution yet?

alanlamsiu avatar Aug 22 '23 04:08 alanlamsiu

I have the same issue when I try to fold two or more multimers in a row. The first multimer will fold and relax without problem. But the second multimer would fail after 100 attemps and with the error "No compatible CUDA device is available". I am using RTX4090 and cuda 11.8 under Ubuntu 22.04 LTS, running alphafold with docker.

First multimer without error but some notes.

I0416 21:45:30.932354 140119761043456 run_docker.py:258] 2024-04-16 13:45:30.932037: W external/xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:523] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.

I0416 21:45:30.932449 140119761043456 run_docker.py:258] Searched for CUDA in the following directories:
I0416 21:45:30.932476 140119761043456 run_docker.py:258] /usr/local/cuda-11.8
I0416 21:45:30.932499 140119761043456 run_docker.py:258] /usr/local/cuda
I0416 21:45:30.932528 140119761043456 run_docker.py:258] .
I0416 21:45:30.932548 140119761043456 run_docker.py:258] You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.

Second multimer with error in relax step.

I0417 07:00:26.561723 140119761043456 run_docker.py:258] I0416 23:00:26.561453 138854821132096 amber_minimize.py:408] Minimizing protein, attempt 100 of 100.

I0417 07:00:27.338001 140119761043456 run_docker.py:258] I0416 23:00:27.337576 138854821132096 amber_minimize.py:69] Restraining 11258 / 22712 particles.
I0417 07:00:27.445065 140119761043456 run_docker.py:258] I0416 23:00:27.444624 138854821132096 amber_minimize.py:418] No compatible CUDA device is available
I0417 07:00:27.468284 140119761043456 run_docker.py:258] Traceback (most recent call last):
I0417 07:00:27.468345 140119761043456 run_docker.py:258] File "/app/alphafold/run_alphafold.py", line 570, in <module>
I0417 07:00:27.470041 140119761043456 run_docker.py:258] app.run(main)
I0417 07:00:27.470084 140119761043456 run_docker.py:258] File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 312, in run
I0417 07:00:27.470112 140119761043456 run_docker.py:258] _run_main(main, args)
I0417 07:00:27.470131 140119761043456 run_docker.py:258] File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 258, in _run_main
I0417 07:00:27.470149 140119761043456 run_docker.py:258] sys.exit(main(argv))
I0417 07:00:27.470167 140119761043456 run_docker.py:258] File "/app/alphafold/run_alphafold.py", line 543, in main
I0417 07:00:27.470187 140119761043456 run_docker.py:258] predict_structure(
I0417 07:00:27.470204 140119761043456 run_docker.py:258] File "/app/alphafold/run_alphafold.py", line 361, in predict_structure
I0417 07:00:27.470221 140119761043456 run_docker.py:258] relaxed_pdb_str, _, violations = amber_relaxer.process(
I0417 07:00:27.470237 140119761043456 run_docker.py:258] File "/app/alphafold/alphafold/relax/relax.py", line 62, in process
I0417 07:00:27.470372 140119761043456 run_docker.py:258] out = amber_minimize.run_pipeline(
I0417 07:00:27.470417 140119761043456 run_docker.py:258] File "/app/alphafold/alphafold/relax/amber_minimize.py", line 476, in run_pipeline
I0417 07:00:27.470456 140119761043456 run_docker.py:258] ret = _run_one_iteration(
I0417 07:00:27.470475 140119761043456 run_docker.py:258] File "/app/alphafold/alphafold/relax/amber_minimize.py", line 420, in _run_one_iteration
I0417 07:00:27.470492 140119761043456 run_docker.py:258] raise ValueError(f"Minimization failed after {max_attempts} attempts.")
I0417 07:00:27.470656 140119761043456 run_docker.py:258] ValueError: Minimization failed after 100 attempts.

Besides the errors, I found a notice from the first run as follows, which did not show in the second run. My intuition is the first run somehow used CUDA succesfully but the second run skipped something and missed the CUDA.

I0416 21:45:30.932354 140119761043456 run_docker.py:258] 2024-04-16 13:45:30.932037: W external/xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:523] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
I0416 21:45:30.932449 140119761043456 run_docker.py:258] Searched for CUDA in the following directories:
I0416 21:45:30.932476 140119761043456 run_docker.py:258] /usr/local/cuda-11.8
I0416 21:45:30.932499 140119761043456 run_docker.py:258] /usr/local/cuda
I0416 21:45:30.932528 140119761043456 run_docker.py:258] .
I0416 21:45:30.932548 140119761043456 run_docker.py:258] You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.

I will be appreciate if anyone can provide either solution or a clue to troubleshoot. Thanks!

jumpintwo avatar Apr 17 '24 03:04 jumpintwo