alphafold icon indicating copy to clipboard operation
alphafold copied to clipboard

Unable to initialize backend 'gpu'

Open caom92 opened this issue 2 years ago • 8 comments

I think I managed to correctly install all the dependencies and requirements as they are described in the documentation. When I run the command docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi to check if I'm able to run CUDA within Docker, I get the following output:

Tue Jun  7 21:34:50 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro K600         Off  | 00000000:03:00.0 Off |                  N/A |
| 25%   46C    P8    N/A /  N/A |     57MiB /   981MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K600         Off  | 00000000:04:00.0 Off |                  N/A |
| 25%   46C    P8    N/A /  N/A |      5MiB /   982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

So, it seems Docker can at least see my GPU. Afterwards, I built the docker image using docker build -f docker/Dockerfile -t alphafold . and then installed the dependencies using pip3 install -r docker/requirements.txt. But, when I try to run the run_docker.py script I get a log message stating that no GPU could be found. The exact command I ran is:

python3 ../alphafold/docker/run_docker.py \
    --fasta_paths=./7mkr.fasta \
    --model_preset=monomer \
    --max_template_date=2020-05-14 \
    --data_dir=/media/$USER/'My Passport' \
    --output_dir=/home/$USER/Documentos/AlphaFold/7mkr

The script seems to run fine all the way to completion, but I get the following message:

I0607 14:44:25.218496 140038534334272 run_docker.py:255] I0607 21:44:25.217542 139683978950464 tpu_client.py:54] Starting the local TPU driver.
I0607 14:44:25.218731 140038534334272 run_docker.py:255] I0607 21:44:25.218080 139683978950464 xla_bridge.py:212] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
I0607 14:44:25.329375 140038534334272 run_docker.py:255] I0607 21:44:25.328556 139683978950464 xla_bridge.py:212] Unable to initialize backend 'gpu': Internal: no supported devices found for platform CUDA
I0607 14:44:25.329644 140038534334272 run_docker.py:255] I0607 21:44:25.328836 139683978950464 xla_bridge.py:212] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
I0607 14:44:25.329788 140038534334272 run_docker.py:255] W0607 21:44:25.328952 139683978950464 xla_bridge.py:215] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

Also, I can't find the output files generated by AlphaFold; they're not in the folder specified by --output_dir and there's no folder /tmp/alphafold, though I suppose this is a separate issue.

To check if it's a problem with the GPU drivers or with Docker, I had the idea to create a new (simple) program that uses CUDA and run it in a separate Docker image to see if it also fails to find the GPU. But I realized I don't even know which component is sending this message; I don't know which of all the components described in the dockerfile and in the run_docker.py script is the one that is telling me it cannot find the GPU. Because of this, I don't even know how to start diagnosing this problem. I'd appreciate if somebody could point me in the right direction.

caom92 avatar Jun 07 '22 21:06 caom92

I have the same issue.

carrellsj avatar Jun 08 '22 14:06 carrellsj

I have a similar issue for a long time running on A100 GPUs. Alphafold sometimes just suddenly breaks, sometimes the Nvidia Fabric manager, sometimes the Docker, sometimes both have to be restarted. I have zero idea of the reasons, why it works for 2-3 weeks and finishes N number of runs, then it suddenly throws a multitude of errors before or during the N+1th run.

Edit since it wasn't clear: One of the most common errors is this, Alphafold not finding GPUs.

masterdesky avatar Jun 30 '22 12:06 masterdesky

I too have this issue.

nmontua avatar Jan 16 '23 15:01 nmontua

I have the same error

nmontua avatar Jan 16 '23 15:01 nmontua

same on RHEL 8 using a Singularity container:

I0503 09:29:04.720219 23456247981888 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
I0503 09:29:04.952224 23456247981888 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter CUDA Host
I0503 09:29:04.952596 23456247981888 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0503 09:29:04.952670 23456247981888 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.

Is this just a warning?

Same warnings with an a100. I attempted to monitor the GPU usage but couldn't reach a conclusion.

kevin931 avatar May 19 '23 21:05 kevin931

its probably a driver compatibility issue with CUDA versions maybe

kbrunnerLXG avatar Jun 05 '23 18:06 kbrunnerLXG

Could you find a solution?

crisdarbellay avatar Nov 23 '23 18:11 crisdarbellay