Alphafold split_msa_prediction mode incompatible with Hopper GPUs
Running --alphafold2_mode split_msa_prediction on our H200 nodes leads to:
- Falling back to the CUDA driver for PTX compilation; ptxas does not support CC 9.0
- failed to get PTX kernel "shift_right_logical" from module: CUDA_ERROR_NOT_FOUND: named symbol not found
- Execution of replica 0 failed: INTERNAL: Could not find the corresponding function
The error occurs in run_alphafold2_pred module.
The workflow runs without issue on our A100 cards. I suspect it is caused by a version incompatibility between the cuda/jax install and compute capability 9 GPUs.
Running the non-split version works without issue. Likely related to #221.
The split version runs with cuda 11 while the non-split version runs with cuda 12.
FYI, this will also be affected by #293 which can be closed by #289
@JoseEspinosa I was able to run on H200s using the latest dev images in the repository.
- https://quay.io/repository/nf-core/proteinfold_alphafold2_msa
- https://quay.io/repository/nf-core/proteinfold_alphafold2_split
Your updated Dockerfiles seem to have fixed it. Thank you!
Thanks @jscgh for reminding me. I test them and forgot to open a PR 😲 But is nice to know it is also working on your HPC. Will do now.
It would be awesome if you could review it @jscgh 🙏
👉 https://github.com/nf-core/proteinfold/pull/304
Closed by #304