proteinfold icon indicating copy to clipboard operation
proteinfold copied to clipboard

Alphafold split_msa_prediction mode incompatible with Hopper GPUs

Open tlitfin-unsw opened this issue 9 months ago • 1 comments

Running --alphafold2_mode split_msa_prediction on our H200 nodes leads to:

  • Falling back to the CUDA driver for PTX compilation; ptxas does not support CC 9.0
  • failed to get PTX kernel "shift_right_logical" from module: CUDA_ERROR_NOT_FOUND: named symbol not found
  • Execution of replica 0 failed: INTERNAL: Could not find the corresponding function

The error occurs in run_alphafold2_pred module.

The workflow runs without issue on our A100 cards. I suspect it is caused by a version incompatibility between the cuda/jax install and compute capability 9 GPUs.

Running the non-split version works without issue. Likely related to #221.

The split version runs with cuda 11 while the non-split version runs with cuda 12.

tlitfin-unsw avatar Apr 14 '25 23:04 tlitfin-unsw

FYI, this will also be affected by #293 which can be closed by #289

tlitfin-unsw avatar Apr 25 '25 01:04 tlitfin-unsw

@JoseEspinosa I was able to run on H200s using the latest dev images in the repository.

  • https://quay.io/repository/nf-core/proteinfold_alphafold2_msa
  • https://quay.io/repository/nf-core/proteinfold_alphafold2_split

Your updated Dockerfiles seem to have fixed it. Thank you!

jscgh avatar May 08 '25 03:05 jscgh

Thanks @jscgh for reminding me. I test them and forgot to open a PR 😲 But is nice to know it is also working on your HPC. Will do now.

JoseEspinosa avatar May 08 '25 08:05 JoseEspinosa

It would be awesome if you could review it @jscgh 🙏
👉 https://github.com/nf-core/proteinfold/pull/304

JoseEspinosa avatar May 08 '25 08:05 JoseEspinosa

Closed by #304

tlitfin-unsw avatar May 08 '25 23:05 tlitfin-unsw