alphafold icon indicating copy to clipboard operation
alphafold copied to clipboard

On docker after some time: No compatible CUDA device is available

Open Wolf-Z opened this issue 3 years ago • 11 comments
trafficstars

The Environment:

Running on Docker 20.10.12-3 on RHEL 7.9 with 4x NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] cards. CUDA version on this host is 11.4.

I also get this version reported, when doing a

  docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Seems fine to me?

The Problem:

So I am running alphafold (version from beginning of march 2022) like this:

python3 /software/alphafold2/alphafold/docker/run_docker.py \
--fasta_paths=/software/db/alphafold2/example_input/T1050.fasta \
--max_template_date=2020-05-14 --data_dir=/software/db/alphafold2/alldata

This is using python3 from a venv installed like this:

    python3 -m venv $envPath
    source $envPath/bin/activate
    pip3 install --upgrade pip
    pip3 install -r /software/alphafold2/alphafold/docker/requirements.txt

After quite some hours this is comming up exactly a 100 times:

I0316 18:50:27.090867 140103983331136 run_docker.py:247] I0316 17:50:27.090230 139823954499392 amber_minimize.py:408] Minimizing protein, attempt 100 of 100.
I0316 18:50:29.051809 140103983331136 run_docker.py:247] I0316 17:50:29.049898 139823954499392 amber_minimize.py:69] Restraining 6213 / 12189 particles.
I0316 18:50:29.170636 140103983331136 run_docker.py:247] I0316 17:50:29.170062 139823954499392 amber_minimize.py:417] No compatible CUDA device is available

... ultimately leading to this:

I0316 18:50:29.207758 140103983331136 run_docker.py:247] Traceback (most recent call last):
I0316 18:50:29.207883 140103983331136 run_docker.py:247] File "/app/alphafold/run_alphafold.py", line 445, in <module>                                           I0316 18:50:29.207957 140103983331136 run_docker.py:247] app.run(main)
I0316 18:50:29.208023 140103983331136 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run                             I0316 18:50:29.208087 140103983331136 run_docker.py:247] _run_main(main, args)
I0316 18:50:29.208148 140103983331136 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
I0316 18:50:29.208209 140103983331136 run_docker.py:247] sys.exit(main(argv))
I0316 18:50:29.208268 140103983331136 run_docker.py:247] File "/app/alphafold/run_alphafold.py", line 429, in main
I0316 18:50:29.208327 140103983331136 run_docker.py:247] is_prokaryote=is_prokaryote)
I0316 18:50:29.208385 140103983331136 run_docker.py:247] File "/app/alphafold/run_alphafold.py", line 250, in predict_structure
I0316 18:50:29.208454 140103983331136 run_docker.py:247] relaxed_pdb_str, _, _ = amber_relaxer.process(prot=unrelaxed_protein)
I0316 18:50:29.208513 140103983331136 run_docker.py:247] File "/app/alphafold/alphafold/relax/relax.py", line 66, in process
I0316 18:50:29.208570 140103983331136 run_docker.py:247] use_gpu=self._use_gpu)
I0316 18:50:29.208626 140103983331136 run_docker.py:247] File "/app/alphafold/alphafold/relax/amber_minimize.py", line 483, in run_pipeline                      I0316 18:50:29.208684 140103983331136 run_docker.py:247] use_gpu=use_gpu)
I0316 18:50:29.208741 140103983331136 run_docker.py:247] File "/app/alphafold/alphafold/relax/amber_minimize.py", line 419, in _run_one_iteration
I0316 18:50:29.208799 140103983331136 run_docker.py:247] raise ValueError(f"Minimization failed after {max_attempts} attempts.")
I0316 18:50:29.208857 140103983331136 run_docker.py:247] ValueError: Minimization failed after 100 attempts.

The Questions:

  • What is Alphafold (docker) considering a "compatible CUDA device"?
  • As I think I have such one - what can I do to make it acknowledged by Alphafold running in docker?

Wolf-Z avatar Mar 17 '22 11:03 Wolf-Z

Hi Wolf-Z,

I saw the same exception thrown. Is the GPU set to the exclusive mode? The algorithm may try to create multiple GPU context (someone please confirm) but the GPU in exclusive mode prevents creating multiple contexts. https://github.com/openmm/openmm/issues/3518

Question to AlphaFold2 developers

Is it possible to disabling the multiple GPU context creation in amber_minimize.py?

jarunan avatar Mar 23 '22 08:03 jarunan

Following - same error - Nvidia has already been set to exclusive mode

arashnh11 avatar Apr 12 '22 03:04 arashnh11

Enabling MPS helps. (nvidia-cuda-mps-control -d) https://docs.nvidia.com/deploy/mps/index.html

jarunan avatar Apr 12 '22 05:04 jarunan

@jarunan Thanks! That looks like another way of changing CUDA's operation to the default mode.. nvidia-smi -c 0 fixed it for me. @Augustin-Zidek You may want to mention in the requirements that GPU compute mode must be set to default or MPS should be enabled as @jarunan mentioned. I did not see this behavior in previous versions of alphafold.

arashnh11 avatar Apr 12 '22 09:04 arashnh11

Update

For some reason without changing anything (well - I had "nvidia-smi -c 0" running before starting but it told me, that all 4 GPUs are already in default mode) and only being in parallel inside the running container to look for some hints - I had the example calculation running fine as a user. So this whole issue maybe just a huge red herring - and I am sorry if that's the case. The only idea I have is, that we have a systemd-tmpfiles job running which during my first try might have been deleted the /tmp/alphafold folder which seems to be very crucical for the alphafold container to work.

Original Comment

By the way. What I've realized in the meantime: I don't get the error when running the container as root. Only as user.

But running this as user is still my main aim.

Please notice: When running the test command (see original issue) then those "sub processes" from the container run fine:

  • /usr/bin/jackhmmer (2 times)
  • /usr/bin/hhsearch
  • /usr/bin/hhblits
  • maybe more - I've just started a new calculation some hours before and the hhblits really takes its time

Anyway - thanks for all the suggestions so far.

  • nvidia-smi -c 0
    • didn't help - every GPU (4 GPUs are installed within the machine) already was in default mode
  • nvidia-cuda-mps-control -d
    • leads to a crash right after the "TPU driver" check:
      • I0412 [...] run_docker.py:247] Fatal Python error: Segmentation fault

The full error in "nvidia-cuda-mps-control -d" mode is (deleted all the time stamps and run_docker.py prefixes):

Current thread 0x00007fd41dac3740 (most recent call first):
"/opt/conda/lib/python3.7/site-packages/jaxlib/xla_client.py", line 98 in make_gpu_client
"/opt/conda/lib/python3.7/site-packages/jax/lib/xla_bridge.py", line 192 in backends
"/opt/conda/lib/python3.7/site-packages/jax/lib/xla_bridge.py", line 228 in get_backend
"/opt/conda/lib/python3.7/site-packages/jax/lib/xla_bridge.py", line 249 in get_device_backen
"/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 138 in _device_put_arr
"/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 133 in device_put
"/opt/conda/lib/python3.7/site-packages/jax/_src/lax/lax.py", line 1596 in _device_put_raw
"/opt/conda/lib/python3.7/site-packages/jax/_src/numpy/lax_numpy.py", line 2996 in array
"/app/alphafold/alphafold/model/utils.py", line 80 in flat_params_to_haiku
"/app/alphafold/alphafold/model/data.py", line 39 in get_model_haiku_params
"/app/alphafold/run_alphafold.py", line 393 in main
"/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258 in _run_main
"/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312 in run
"/app/alphafold/run_alphafold.py", line 445 in <module>
/app/run_alphafold.sh: line 3:     9 Segmentation fault      python /app/alphafold/run_alphafold.py "$@"

My last thought (for Today):

There were suggestions that OpenMM is the source of the problem. In order to check if OpenMM is working fine, they say: Check its included test cases. Unfortunately building those test cases from inside the container is really a lot of work because of missing compilers (obviously), RO Conda Env, include paths and libpaths all over the place under /opt - not really something I'd like to do. All I want to check is: Can OpenMM see my Cuda devices, where does it look for those and are there maybe some permission problems?

Are there any suggestions how to check openMM without building it's test cases from within the container?_

Wolf-Z avatar Apr 12 '22 15:04 Wolf-Z

Hi, I ran into the same issue, in case others read this, maybe it helps someone.

the error "No compatible CUDA device is available" (see orgininal post #403 ) happened to me when I on the same workstation started two multimer predictions.

apparently whats happening is that the program uses only the first of my (3 in total) GPUs. When I start 2 precictions, both seem to compete over the same GPU and as a result I got the No compatible CUDA device is available error.

when I add the parameter --gpu_devices= 0 on the first prediction and --gpu_devices=1 to the second, both seem to be runnig without the above error. So, apparently one cannot have more than one job per GPU, and the GPU should be specified if one wants to run several predictions at the same time. As others also had reported, I did not manage to have more than one GPU utilized per job so --gpu_devices=0,1 is the same as --gpu_devices=0 it looks like the second GPU is being ignored

132nd-Entropy avatar Sep 29 '22 14:09 132nd-Entropy

I've been having the same problem and I don't run two predictions at the same time. Anyone have any idea how to fix it?

akenginorhun avatar Feb 20 '23 21:02 akenginorhun

I think the root cause is in amber_minimize.py in the _openmm_minimize() function.

OpenMM does allow multiple GPUs, but you have to define properties["DeviceIndex"] to explicitly enumerate the devices available. See: http://docs.openmm.org/latest/userguide/library/04_platform_specifics.html#cuda-platform

However, _openmm_minimize() in v2.3.1 does not do so after defining the Platform. This setting is available in the C++ API, but does not seem to by in the Python API.

Wrote a bit more about this in my issue in alphafold_singularity that is auto-linked above.

prehensilecode avatar Feb 25 '23 01:02 prehensilecode

Enabling MPS helps. (nvidia-cuda-mps-control -d) https://docs.nvidia.com/deploy/mps/index.html

This works!~

FeliciaJiangBio avatar Oct 25 '23 13:10 FeliciaJiangBio