alphafold icon indicating copy to clipboard operation
alphafold copied to clipboard

error running with a nvidia blackwell RTX5090

Open ocstx opened this issue 7 months ago • 7 comments

Hi all I installed AF without a problem in the following computer:

ASUS PRIME X870-P WIFI AMD Ryzen 9 7950X 16-Core Processor OS: AlmaLinux 8.10 (Cerulean Leopard) nvidia-smi: NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 I have cuda 12.9 12.8 12.2 and 12.0 installed. Since this is a blackwell GPU only the nvidia-open driver flavor is supported (the one that I installed)

this is my python env for af runs: Python 3.11.11 absl-py==1.0.0 certifi==2025.4.26 charset-normalizer==2.1.1 docker==5.0.0 idna==3.10 requests==2.28.1 six==1.17.0 urllib3==1.26.20 websocket-client==1.8.0

when I run a alphafold job it starts. I can see jackhmmer running.

Here is the command:

JAX_TRACEBACK_FILTERING=off python3 /path_to_my_dir/ALPHAFOLD_GITHUB/alphafold/docker/run_docker.py --fasta_paths=/path_to_my_dir/ALPHAFOLD/TEST/Rpff2_toAlphF.fst --output_dir=/path_to_my_dir/ALPHAFOLD/TEST/A
FOUT/REDUCED-20250606_162608 --data_dir=/path_to_my_dir/ALPHAFOLD/DOWN_DBS --max_template_date=2024-12-31 --db_preset=reduced_dbs --model_preset=monomer_ptm

But when it starts generating the first model I get the following error:

I0606 15:08:19.539341 139629339072320 run_docker.py:258] Traceback (most recent call last): I0606 15:08:19.539410 139629339072320 run_docker.py:258] File "/app/alphafold/run_alphafold.py", line 570, in <module> I0606 15:08:19.539434 139629339072320 run_docker.py:258] app.run(main) I0606 15:08:19.539455 139629339072320 run_docker.py:258] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 312, in run I0606 15:08:19.539471 139629339072320 run_docker.py:258] _run_main(main, args) I0606 15:08:19.539486 139629339072320 run_docker.py:258] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 258, in _run_main I0606 15:08:19.539502 139629339072320 run_docker.py:258] sys.exit(main(argv)) I0606 15:08:19.539515 139629339072320 run_docker.py:258] ^^^^^^^^^^ I0606 15:08:19.539530 139629339072320 run_docker.py:258] File "/app/alphafold/run_alphafold.py", line 543, in main I0606 15:08:19.539554 139629339072320 run_docker.py:258] predict_structure( I0606 15:08:19.539569 139629339072320 run_docker.py:258] File "/app/alphafold/run_alphafold.py", line 284, in predict_structure I0606 15:08:19.539583 139629339072320 run_docker.py:258] prediction_result = model_runner.predict(processed_feature_dict, I0606 15:08:19.539597 139629339072320 run_docker.py:258] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I0606 15:08:19.539610 139629339072320 run_docker.py:258] File "/app/alphafold/alphafold/model/model.py", line 167, in predict I0606 15:08:19.539624 139629339072320 run_docker.py:258] result = self.apply(self.params, jax.random.PRNGKey(random_seed), feat) I0606 15:08:19.539638 139629339072320 run_docker.py:258] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I0606 15:08:19.539651 139629339072320 run_docker.py:258] File "/opt/conda/lib/python3.11/site-packages/jax/_src/random.py", line 241, in PRNGKey I0606 15:08:19.539664 139629339072320 run_docker.py:258] return _return_prng_keys(True, _key('PRNGKey', seed, impl)) I0606 15:08:19.539682 139629339072320 run_docker.py:258] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ I0606 15:08:19.539695 139629339072320 run_docker.py:258] File "/opt/conda/lib/python3.11/site-packages/jax/_src/random.py", line 203, in _key I0606 15:08:19.539713 139629339072320 run_docker.py:258] return prng.random_seed(seed, impl=impl) I0606 15:08:19.539726 139629339072320 run_docker.py:258] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I0606 15:08:19.539738 139629339072320 run_docker.py:258] File "/opt/conda/lib/python3.11/site-packages/jax/_src/prng.py", line 639, in random_seed I0606 15:08:19.539750 139629339072320 run_docker.py:258] return random_seed_p.bind(seeds_arr, impl=impl) I0606 15:08:19.539766 139629339072320 run_docker.py:258] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I0606 15:08:19.539779 139629339072320 run_docker.py:258] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 387, in bind I0606 15:08:19.539791 139629339072320 run_docker.py:258] return self.bind_with_trace(find_top_trace(args), args, params) I0606 15:08:19.539805 139629339072320 run_docker.py:258] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I0606 15:08:19.539818 139629339072320 run_docker.py:258] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 391, in bind_with_trace I0606 15:08:19.539832 139629339072320 run_docker.py:258] out = trace.process_primitive(self, map(trace.full_raise, args), params) I0606 15:08:19.539847 139629339072320 run_docker.py:258] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I0606 15:08:19.539859 139629339072320 run_docker.py:258] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 879, in process_primitive I0606 15:08:19.539913 139629339072320 run_docker.py:258] return primitive.impl(*tracers, **params) I0606 15:08:19.539929 139629339072320 run_docker.py:258] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I0606 15:08:19.539941 139629339072320 run_docker.py:258] File "/opt/conda/lib/python3.11/site-packages/jax/_src/prng.py", line 651, in random_seed_impl I0606 15:08:19.539990 139629339072320 run_docker.py:258] base_arr = random_seed_impl_base(seeds, impl=impl) I0606 15:08:19.540006 139629339072320 run_docker.py:258] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I0606 15:08:19.540019 139629339072320 run_docker.py:258] File "/opt/conda/lib/python3.11/site-packages/jax/_src/prng.py", line 656, in random_seed_impl_base I0606 15:08:19.540048 139629339072320 run_docker.py:258] return seed(seeds) I0606 15:08:19.540071 139629339072320 run_docker.py:258] ^^^^^^^^^^^ I0606 15:08:19.540085 139629339072320 run_docker.py:258] File "/opt/conda/lib/python3.11/site-packages/jax/_src/prng.py", line 885, in threefry_seed I0606 15:08:19.540120 139629339072320 run_docker.py:258] return _threefry_seed(seed) I0606 15:08:19.540134 139629339072320 run_docker.py:258] ^^^^^^^^^^^^^^^^^^^^ I0606 15:08:19.540149 139629339072320 run_docker.py:258] jaxlib.xla_extension.XlaRuntimeError: INTERNAL: ptxas exited with non-zero error code 65280, output: ptxas fatal : Program with .target 'sm_90a' cannot be compiled to future architecture I0606 15:08:19.540164 139629339072320 run_docker.py:258] : If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided. I0606 15:08:19.540177 139629339072320 run_docker.py:258] -------------------- I0606 15:08:19.540189 139629339072320 run_docker.py:258] For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

PS: I have installed and run alphafold without a problem in different computers, all with different hardware, but all of them with the same OS.

Any help will be much appreciated

ocstx avatar Jun 06 '25 14:06 ocstx

Having identical issue here, happy to hear any suggestions for fix. Installing on a second computer with RTX5090 vs previous 4090, and while install succeeds, I'm getting the same traceback as above.

strnadja avatar Jun 11 '25 17:06 strnadja

@ocstx Have you found a solution on the 5090 yet?

strnadja avatar Jul 03 '25 21:07 strnadja

No, I had been focus on alphafold3, I found a fix for it in its issue 394, when I have the time I'll try to apply the same idea (update ubuntu image and python dependencies versions) to AF2, but I I lack the knowledge to know what makes sense on how to get the right combination (I never managed to get ESMfold working). If you could try I would very much appreciate it.

ocstx avatar Jul 04 '25 05:07 ocstx

I've got it running for the GEFORCE RTX 5090 with driver version 570.153.02 and CUDA 12.8.0. In brief, I believe the most important changes included:

  • using the nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04 image
  • using jax[cuda12]==0.6.0
  • upgrading to dm-haiku==0.0.14
  • installing tensorflow==2.12.0 (alongside tensorflow-cpu)

Here are my Dockerfile and requirements.txt:

Dockerfile.txt

requirements.txt

I haven't robustly tested it, but I'm getting pdbs out of a --model_preset=multimer run, so this at least gets you to a good starting point!

strnadja avatar Jul 10 '25 23:07 strnadja

It totally worked for me! I run my standard test run which uses "--model_preset=monomer_ptm" and it delivered what was expected. The generating models part (the one really using GPU) was 20% faster than the same run with a RTX4090

thank you very much @strnadja

ocstx avatar Jul 11 '25 13:07 ocstx

Using this Dockerfile will prompt a conflict in conda dependencies. Is it necessary to fully install this to build the operation?

This is strange. Last July it worked without errors. Could you include the full stder/stdout? You'll have to run Docker like this.

docker build --no-cache --progress=plain -f docker/Dockerfile -t alphafold . 2>&1 | tee build.log

ocstx avatar Sep 10 '25 09:09 ocstx

I cannot see messages from @976282479, but I received them via email But the problem sims to be a conda error when running docker build.

In any case I chacked my build.log. This is there:

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)

So I'm guessing it is not important. Regarding this one:

#8 1.940 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding: #8 1.940 - https://repo.anaconda.com/pkgs/main #8 1.940 - https://repo.anaconda.com/pkgs/r #8 1.940 #8 1.940 To accept these channels' Terms of Service, run the following commands: #8 1.940 conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main #8 1.940 conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r #8 1.940 #8 1.940 For information on safely removing channels from your conda configuration, #8 1.940 please see the official documentation: #8 1.940 #8 1.940 https://www.anaconda.com/docs/tools/working-with-conda/channels #8 1.940 #8 ERROR: process "/bin/bash -o pipefail -c conda install --quiet --yes conda==24.11.1 pip python=3.11 && conda install --quiet --yes --channel conda-forge libstdcxx-ng>=12.1.0 openmm=8.0.0 pdbfixer && conda clean --all --force-pkgs-dirs --yes" did not complete successfully: exit code: 1 It is not in my build.log, but that seems a license thing, myabe something has changed. Did you follow the instructions in the error? I'm guessing that it could be acomplished by changing the line: RUN conda install --quiet --yes conda==24.11.1 pip python=3.11 && conda install --quiet --yes --channel conda-forge libstdcxx-ng>=12.1.0 openmm=8.0.0 pdbfixer && conda clean --all --force-pkgs-dirs --yes

to: RUN conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main && conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r && conda install --quiet --yes conda==24.11.1 pip python=3.11 && conda install --quiet --yes --channel conda-forge libstdcxx-ng>=12.1.0 openmm=8.0.0 pdbfixer && conda clean --all --force-pkgs-dirs --yes

But I'm not an expert in this. If someone finds the same problem and checks this, It would be appreciated

ocstx avatar Sep 15 '25 06:09 ocstx