alphafold icon indicating copy to clipboard operation
alphafold copied to clipboard

How do i fix the "Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:" error

Open kbrunnerLXG opened this issue 1 year ago • 14 comments

I0608 08:48:12.491065 140245639700480 run_docker.py:116] Mounting /home/aditya/Documents/Spodoptera_frugiperda/proteins -> /mnt/fasta_path_0 I0608 08:48:12.491170 140245639700480 run_docker.py:116] Mounting /home/aditya/Documents/Spodoptera_frugiperda/proteins -> /mnt/fasta_path_1 I0608 08:48:12.491236 140245639700480 run_docker.py:116] Mounting /home/aditya/Documents/Spodoptera_frugiperda/proteins -> /mnt/fasta_path_2 I0608 08:48:12.491298 140245639700480 run_docker.py:116] Mounting /home/aditya/Documents/Spodoptera_frugiperda/proteins -> /mnt/fasta_path_3 I0608 08:48:12.491358 140245639700480 run_docker.py:116] Mounting /home/aditya/Documents/Spodoptera_frugiperda/proteins -> /mnt/fasta_path_4 I0608 08:48:12.491417 140245639700480 run_docker.py:116] Mounting /home/aditya/Documents/Spodoptera_frugiperda/proteins -> /mnt/fasta_path_5 I0608 08:48:12.491476 140245639700480 run_docker.py:116] Mounting /home/aditya/Documents/Spodoptera_frugiperda/proteins -> /mnt/fasta_path_6 I0608 08:48:12.491533 140245639700480 run_docker.py:116] Mounting /home/aditya/Documents/Spodoptera_frugiperda/proteins -> /mnt/fasta_path_7 I0608 08:48:12.491591 140245639700480 run_docker.py:116] Mounting /home/aditya/Documents/Spodoptera_frugiperda/proteins -> /mnt/fasta_path_8 I0608 08:48:12.516574 140245639700480 run_docker.py:116] Mounting /media/aditya/New Volume/protein_database/uniref90 -> /mnt/uniref90_database_path I0608 08:48:12.516844 140245639700480 run_docker.py:116] Mounting /media/aditya/New Volume/protein_database/mgnify -> /mnt/mgnify_database_path I0608 08:48:12.516937 140245639700480 run_docker.py:116] Mounting /media/aditya/New Volume/protein_database -> /mnt/data_dir I0608 08:48:12.525529 140245639700480 run_docker.py:116] Mounting /media/aditya/New Volume/protein_database/pdb_mmcif/mmcif_files -> /mnt/template_mmcif_dir I0608 08:48:12.536713 140245639700480 run_docker.py:116] Mounting /media/aditya/New Volume/protein_database/pdb_mmcif -> /mnt/obsolete_pdbs_path I0608 08:48:12.541885 140245639700480 run_docker.py:116] Mounting /media/aditya/New Volume/protein_database/pdb70 -> /mnt/pdb70_database_path I0608 08:48:12.552317 140245639700480 run_docker.py:116] Mounting /media/aditya/New Volume/protein_database/small_bfd -> /mnt/small_bfd_database_path I0608 08:48:19.318056 140245639700480 run_docker.py:258] I0608 03:18:19.316803 139978587400000 templates.py:857] Using precomputed obsolete pdbs /mnt/obsolete_pdbs_path/obsolete.dat. I0608 08:48:21.933797 140245639700480 run_docker.py:258] I0608 03:18:21.933366 139978587400000 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: I0608 08:48:22.052647 140245639700480 run_docker.py:258] I0608 03:18:22.052219 139978587400000 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter Host CUDA I0608 08:48:22.052834 140245639700480 run_docker.py:258] I0608 03:18:22.052553 139978587400000 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client' I0608 08:48:22.052919 140245639700480 run_docker.py:258] I0608 03:18:22.052629 139978587400000 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this. I0608 08:48:31.968616 140245639700480 run_docker.py:258] I0608 03:18:31.968227 139978587400000 run_alphafold.py:466] Have 5 models: ['model_1_pred_0', 'model_2_pred_0', 'model_3_pred_0', 'model_4_pred_0', 'model_5_pred_0'] I0608 08:48:31.968794 140245639700480 run_docker.py:258] I0608 03:18:31.968350 139978587400000 run_alphafold.py:480] Using random seed 1666331041262993896 for the data pipeline I0608 08:48:31.968888 140245639700480 run_docker.py:258] I0608 03:18:31.968505 139978587400000 run_alphafold.py:218] Predicting XP_035454815.2 I0608 08:48:31.970123 140245639700480 run_docker.py:258] I0608 03:18:31.969789 139978587400000 jackhmmer.py:133] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/tmpxxi5lneh/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /mnt/fasta_path_0/XP_035454815.2.fasta /mnt/uniref90_database_path/uniref90.fasta" I0608 08:48:31.991148 140245639700480 run_docker.py:258] I0608 03:18:31.990779 139978587400000 utils.py:36] Started Jackhmmer (uniref90.fasta) query

kbrunnerLXG avatar Jun 08 '23 03:06 kbrunnerLXG

+1 for same question, not clear if the warnings at the start are indicative of GPU issue or not?

rocketman8080 avatar Jun 13 '23 03:06 rocketman8080

Same here. "features" step is extremely slow (1800 for 100 AA). Running on a GCloud compute instance. Followed the installation instructions for a docker container correctly as far as I can tell.

Update: At least here it says, that these messages can be ignored: https://confluence.desy.de/display/MXW/alphafold+2.1.1+-+docker

However, prediction is still super slow. I'm hanging at Jackhmmer step for ages.

andrejberg avatar Jun 26 '23 09:06 andrejberg

I have the same issue, Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: error pops up then features step is extremely slow even though the GPU shows a python process running.

Also followed the installation instructions successfully... Might be a problem with jax and cuda/cudnn versioning?

OCald avatar Jul 18 '23 13:07 OCald

This is most probably a driver compatibility issue imo caused by update to CUDA 12 as CUDA 11.8 is the max version currently in support by tensorflow

kbrunnerLXG avatar Jul 18 '23 14:07 kbrunnerLXG

To chime in here (because I think I have the same issue), maybe I've misunderstood, but the Dockerfile in alphafold/docker/ requests a CUDA 11.1.1 version, and the CUDA version on the GPU metal (i.e. on your local machine) should not influence whether or not a lower CUDA version can be run in a Docker container.

To convince yourself of this, you can run:

docker pull tensorflow/tensorflow:2.4.0-gpu

to get the vanilla tensor flow image, boot it up:

docker run --gpus all -it tensorflow/tensorflow:2.4.0-gpu bash

and then establish both the CUDA version inside the container by running:

$ nvcc --version

Which should yield

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

And then prove to yourself tensorflow is GPU accessible in Python by starting up a Python interpreter

$ python

Followed by

import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)
else:
    print("No GPUs found")

Which hopefully prints the number of physical and logical GPUs you have available out.

This doesn't get us any closer to solving the issue, but I don't think your non-Docker CUDA version should impede a Docker container running a lower CUDA version.

alexholehouse avatar Aug 02 '23 01:08 alexholehouse

Anyone able to solve this?

rocketman8080 avatar Aug 21 '23 02:08 rocketman8080

For now, we're also ignoring them in our lab, and other than Jackhmmer step taking a very long time, works as expected.

Update: At least here it says, that these messages can be ignored: https://confluence.desy.de/display/MXW/alphafold+2.1.1+-+docker However, prediction is still super slow. I'm hanging at Jackhmmer step for ages.

daphn3k avatar Aug 22 '23 12:08 daphn3k

I have the same question.Jackhmmer step takes a long time. Who can help me?

AndrewLisz avatar Oct 13 '23 06:10 AndrewLisz

I want to know if this error could influence the outcome of the prediction or just let the jackhmmer step to be slow

Lili-irtyd avatar Jan 12 '24 04:01 Lili-irtyd

I want to know if this error could influence the outcome of the prediction or just let the jackhmmer step to be slow

it just let the jackhammer step be very slow. You can try colabfold.

AndrewLisz avatar Jan 20 '24 06:01 AndrewLisz

Does anyone have a solution/workaround for this? As others have mentioned, the jackhammer step is extremely slow and makes the entire run take like >25 minutes for a simple sequence of ~amino acids?

jonlevi avatar Feb 09 '24 19:02 jonlevi

I want to know if this error could influence the outcome of the prediction or just let the jackhmmer step to be slow

it just let the jackhammer step be very slow. You can try colabfold.

I don't see how this error has anything to do at all with jackhmmer. Jackhmmer as far as I know runs purely on the CPU.

fredricj avatar Feb 19 '24 00:02 fredricj

I want to know if this error could influence the outcome of the prediction or just let the jackhmmer step to be slow

it just let the jackhammer step be very slow. You can try colabfold.

I don't see how this error has anything to do at all with jackhmmer. Jackhmmer as far as I know runs purely on the CPU.

Agreed; I don't think this is actually a major issue, in retrospect, although I'm not sure. But I agree jackhmmer should not depend on the GPU.

alexholehouse avatar Feb 19 '24 11:02 alexholehouse

RTX4090, +1 for same question, not clear if the warnings at the start are indicative of GPU issue or not?

sunqiangzai avatar Mar 25 '24 15:03 sunqiangzai