alphafold
alphafold copied to clipboard
Unable to initialize backend 'gpu'
I think I managed to correctly install all the dependencies and requirements as they are described in the documentation.
When I run the command docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
to check if I'm able to run CUDA within Docker, I get the following output:
Tue Jun 7 21:34:50 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro K600 Off | 00000000:03:00.0 Off | N/A |
| 25% 46C P8 N/A / N/A | 57MiB / 981MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro K600 Off | 00000000:04:00.0 Off | N/A |
| 25% 46C P8 N/A / N/A | 5MiB / 982MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
So, it seems Docker can at least see my GPU.
Afterwards, I built the docker image using docker build -f docker/Dockerfile -t alphafold .
and then installed the dependencies using pip3 install -r docker/requirements.txt
.
But, when I try to run the run_docker.py
script I get a log message stating that no GPU could be found.
The exact command I ran is:
python3 ../alphafold/docker/run_docker.py \
--fasta_paths=./7mkr.fasta \
--model_preset=monomer \
--max_template_date=2020-05-14 \
--data_dir=/media/$USER/'My Passport' \
--output_dir=/home/$USER/Documentos/AlphaFold/7mkr
The script seems to run fine all the way to completion, but I get the following message:
I0607 14:44:25.218496 140038534334272 run_docker.py:255] I0607 21:44:25.217542 139683978950464 tpu_client.py:54] Starting the local TPU driver.
I0607 14:44:25.218731 140038534334272 run_docker.py:255] I0607 21:44:25.218080 139683978950464 xla_bridge.py:212] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
I0607 14:44:25.329375 140038534334272 run_docker.py:255] I0607 21:44:25.328556 139683978950464 xla_bridge.py:212] Unable to initialize backend 'gpu': Internal: no supported devices found for platform CUDA
I0607 14:44:25.329644 140038534334272 run_docker.py:255] I0607 21:44:25.328836 139683978950464 xla_bridge.py:212] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
I0607 14:44:25.329788 140038534334272 run_docker.py:255] W0607 21:44:25.328952 139683978950464 xla_bridge.py:215] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Also, I can't find the output files generated by AlphaFold; they're not in the folder specified by --output_dir
and there's no folder /tmp/alphafold
, though I suppose this is a separate issue.
To check if it's a problem with the GPU drivers or with Docker, I had the idea to create a new (simple) program that uses CUDA and run it in a separate Docker image to see if it also fails to find the GPU.
But I realized I don't even know which component is sending this message; I don't know which of all the components described in the dockerfile and in the run_docker.py
script is the one that is telling me it cannot find the GPU.
Because of this, I don't even know how to start diagnosing this problem.
I'd appreciate if somebody could point me in the right direction.
I have the same issue.
I have a similar issue for a long time running on A100 GPUs. Alphafold sometimes just suddenly breaks, sometimes the Nvidia Fabric manager, sometimes the Docker, sometimes both have to be restarted. I have zero idea of the reasons, why it works for 2-3 weeks and finishes N
number of runs, then it suddenly throws a multitude of errors before or during the N+1
th run.
Edit since it wasn't clear: One of the most common errors is this, Alphafold not finding GPUs.
I too have this issue.
I have the same error
same on RHEL 8 using a Singularity container:
I0503 09:29:04.720219 23456247981888 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
I0503 09:29:04.952224 23456247981888 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter CUDA Host
I0503 09:29:04.952596 23456247981888 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0503 09:29:04.952670 23456247981888 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
Is this just a warning?
Same warnings with an a100. I attempted to monitor the GPU usage but couldn't reach a conclusion.
its probably a driver compatibility issue with CUDA versions maybe
Could you find a solution?