openfold
openfold copied to clipboard
Build openfold with newer ppytorch + cuda
Right now openfold asks for old pytorch + cuda (11.2), thus latest linux is not able to build openfold.
Would like to upgrade the supported pytorch + cuda and other python packages accordingly, so people can use newer platform(OS, etc)
Indeed, preparing the environment on newer platform is far from easy. I just did it this way on my Debian testing rolling setup:
mamba create -n of
mamba activate of
mamba install -c pytorch -c nvidia -c conda-forge -c bioconda pytorch pytorch-cuda=12.1 python=3.10 packaging ninja hhsuite kalign2 openmm pdbfixer biopython pytorch-lightning PyYAML tqdm wandb awscli aria2 hmmer deepspeed dm-tree py3Dmol modelcif
pip install flash-attention ml_collections git+https://github.com/NVIDIA/dllogger.git
git clone [email protected]:aqlaboratory/openfold.git
cd openfold
sed -i -e 's/-std=c++14/-std=c++17/' setup.py
scripts/install_third_party_dependencies.sh
mamba deactivate
mamba activate of
When iterating towards these lines, I encountered few pitfalls:
- hhsuite requires python 10 (not 11)
- pytorch 2.2.0 requiers compilation with c++17, not 14
- you really need well installed CUDA (that it worked for other things is not enough)
And the preceding CUDA setup has pitfalls as well:
- non-free-firmware section is new in Debian, might be missing in sources.list but is vital
- recent Nvidia Ampere driver does not compile with recent Linux kernel but Debian has fixed driver in bookworm-updates
I find CUDA setup via Debian repos easier than via Nvidia (in fact at this moment the critical bug 4336331 in Nvidia driver is only fixed in Debian). I add these sources:
deb http://deb.debian.org/debian/ bookworm-updates main non-free contrib non-free-firmware
deb-src http://deb.debian.org/debian/ bookworm-updates main non-free contrib non-free-firmware
and install:
apt install nvidia-cuda-dev nvidia-cuda-toolkit
I got these package versions:
pytorch 2.2.0 py3.10_cuda12.1_cudnn8.9.2_0 pytorch
pytorch-cuda 12.1 ha16c6d3_5 pytorch
python 3.10.13 hd12c33a_1_cpython conda-forge
packaging 23.2 pyhd8ed1ab_0 conda-forge
ninja 1.11.1 h924138e_0 conda-forge
hhsuite 3.3.0 py310pl5321h068649b_9 bioconda
kalign2 2.04 h031d066_5 bioconda
openmm 8.1.1 py310h358ce72_1 conda-forge
pdbfixer 1.9 pyh1a96a4e_0 conda-forge
biopython 1.83 py310h2372a71_0 conda-forge
pytorch-lightning 2.1.3 pyhd8ed1ab_0 conda-forge
pyyaml 6.0.1 py310h2372a71_1 conda-forge
tqdm 4.66.2 pyhd8ed1ab_0 conda-forge
wandb 0.16.3 pyhd8ed1ab_0 conda-forge
awscli 2.15.21 py310hff52083_0 conda-forge
aria2 1.37.0 h347180d_1 conda-forge
hmmer 3.4 hdbdd923_0 bioconda
deepspeed 0.13.1 cpu_py310h11dbdba_0 conda-forge
dm-tree 0.1.8 py310h620c231_2 conda-forge
py3dmol 2.0.4 pyhd8ed1ab_0 conda-forge
modelcif 0.9 pyhd8ed1ab_0 conda-forge
flash-attention 1.0.0 pypi_0 pypi
ml-collections 0.1.1 pypi_0 pypi
dllogger 1.0.0 pypi_0 pypi
and outside mamba environment, I have:
$ gcc --version
gcc (Debian 10.3.0-15) 10.3.0
Copyright (C) 2020 Free Software Foundation, Inc.
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
$ nvidia-smi
NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0
$ cat /proc/version
Linux version 6.6.15-amd64 ([email protected]) (gcc-13 (Debian 13.2.0-13) 13.2.0, GNU ld (GNU Binutils for Debian) 2.42) #1 SMP PREEMPT_DYNAMIC Debian 6.6.15-2 (2024-02-04)
It is possible that OpenFold will need some little tweeks here and there with this setup. But I hope this helps a little bit...
Meanwhile, PR #407 just landed in the codebase (thanks @jnwei and everybody involved!) and it is supposed to tackle these issues. While it certainly moves the code forward (and may likely contain some needed "tweeks here and there" I mentioned above), it still does not allow me to install an environment just following the install instructions in README. When I do:
mamba env create -n openfold_env -f environment.yml
it tries to install some older packages than PR407 description suggests:
Looking for: ['python=3.9', 'libgcc=7.2', 'setuptools=59.5.0', 'pip', 'openmm=7.7', 'pdbfixer', 'cudatoolkit=11.3', 'pytorch-lightning==1.5.10', 'biopython==1.79', 'numpy==1.21', 'pandas==2.0', 'pyyaml==5.4.1', 'requests', 'scipy==1.7', 'tqdm==4.62.2', 'typing-extensions==3.10', 'wandb==0.12.21', 'modelcif==0.7', 'awscli', 'ml-collections', 'aria2', 'git', 'bioconda::hmmer==3.3.2', 'bioconda::hhsuite==3.3.0', 'bioconda::kalign2==2.04', 'pytorch::pytorch=1.12']
and then fails with:
RuntimeError:
The detected CUDA version (12.0) mismatches the version that was used to compile
PyTorch (11.3). Please make sure to use the same CUDA versions.
In the current environment.yml
I find suspicious pytorch-lightning==1.5.10
which might lead to older Pytorch (?), in my experiments above I got pytorch-lightning=2.1.3
. (Or maybe python=3.9
is a problem? Certainly python=3.10
worked better for me and might influence the available Pytorch versions.)
Also, the C++14/17 patch (which I did above via sed
) would likely be needed for the following compilation in scripts/install_third_party_dependencies.sh
to succeed.
@abeebyekeen I see that you also devoted considerable effort to setting up environment.yml. It would be nice to hear how your current setup works for you. (I am particularly interested in effects of the 'cuda' conda package - maybe it allows even more minimalist cuda setup in the operating system? Just kernel driver?) Or how it compares with my setup (2nd post in this thread) if you have any incentive to try.
Hi @vaclavhanzl. Yes, I spent a good part of last weekend trying to setup a tool that requires openfold as a dependency. I was initially unable to build openfold due to a number of problems including -std=c++14
, gcc/g++
, CUDA version mismatch
errors as you have mentioned. Please note that I also had other CUDA errors thrown by pytorch. So I created a fork to try and figure out where each error was coming from (especially with building openfold).
Here are what I've got and the selections that eventually worked for me in solving all the problems:
- Python version: 3.9.18
- Linux: Red Hat 4.8.5-44, 3.10.0-1160.88.1.el7.x86_64
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
$ gcc --version
gcc (GCC) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
For the environment I needed, here is how I set it up:
mamba create -npl python==3.9 pip
mamba activate npl
mamba install cudatoolkit==11.8.*
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --no-cache-dir
python -m pip install torch-scatter==2.1.2 -f https://data.pyg.org/whl/torch-2.2.0+118.html --no-cache-dir
python -m pip install "git+https://github.com/facebookresearch/pytorch3d.git" --force-reinstall --no-deps --no-cache-dir
To get openfold to build, I was able to use the environment.yml
in the openfold repo without changes. I, however, had to use a gcc
version between 5.
and 11.
, and also set the build flags in setup.py
to std=c++17
just like @vaclavhanzl did:
git clone https://github.com/aqlaboratory/openfold.git
cd openfold
sed -i 's/std=c++14/std=c++17/g' setup.py
python -m pip install .
python -m pip install -r other_requirements.txt
And that works perfectly.
Thanks a lot @abeebyekeen for sharing this with us, in a very clear way! I'd very much like to see OpenFold working out of the box for most new users, especially now when there are new great and publicized features. I still do not know how a PR going that way should look, addressing different setups people might have is hard. I was thinking about having ranges of versions in environment.yml but that is not an easy way either. I guess @jnwei might have some plan here and I hope we help at least a little bit to move that way.
Hi all,
Thanks for all the interest and for sharing notes! The environments I wanted to support at this time were:
- For use with CUDA 11.x: (main branch) pytorch 1.12 + pytorch lightning 1.5.10 (+ flash-attn dependencies)
- For use with CUDA 12.x: (pl_upgrades) pytorch 2.1 + pytorch lightning 2.x
I just checked that the pl_upgrades branch on two systems I have access to with pre-installed CUDA 12, and found that they were working for me. Let me know if folks have issues with this environments.
My understanding was that having an environment which is CUDA 11.x + Pytorch 2.x is complicated, as the default pytorch 2 packages are built on CUDA 12 (leading to the CUDA mismatch error @vaclavhanzl saw). It looks like @abeebyekeen was able to find a workaround with a lot of elbow grease, thanks for sharing your fix!
I plan on cleaning up the documentation for this project and when I do, I'll add a page regarding the supported environments.
Thanks @jnwei ! I am happy to report that the pl_upgrades
branch works flawlessly with my CUDA 12, including compilations in install_third_party_dependencies.sh
(the end of the 2nd post here describes my exact environment outside conda).
(And please @jnwei excuse my rather misguided comments on PR #407 - I totally overlooked that you merged to pl_upgrades
, not to main
.)
quick question about future plan:
is pl_upgrades
going to be merged into master?
Hi there, just wanted to follow up on this and ask if there are any plans/timelines to merge pl_upgrades
into main
to get the CUDA12 support into openfold? Many thanks in advance!
@vaclavhanzl @jnwei
Hi thanks for the interest. We're actively working on finalizing the changes in pl_upgrades
into main
. I'd expect ~3ish weeks
Thank you @jnwei , appreciate the update!
A quick note on the pytorch 2 / CUDA 12 upgrade:
We've run into some technical issues with the pytorch 2 upgrade. Briefly, we observe large instabilities in our training losses in the pytorch2 version relative to our pytorch 1 version.
For inference, we're also observing a slight difference between model outputs in pytorch 1 and pytorch 2. The difference in final output coordinates is about RMSD~0.05A for the proteins I've looked at While these differences might seem small, it may point to a larger issue that is also occurring in training; we're currently looking into it.
Until we find the root cause of the discrepancy, or a way around the training instability, we're not ready to update the main branch to pytorch 2.
Meanwhile, we will upgrade the main branch to use pytorch lightning 2, which has a few features that the team has found useful. I'll also push some changes to pl_upgrades that integrate some of the changes from the main branch, and cleans up the conda environment / docker for a CUDA 12 / pytorch 2.
We are actively working on debugging the instability, and we'll keep you posted as soon as we are ready to upgrade. Thank you all for your interest and your patience.
Hi @jnwei @vaclavhanzl @abeebyekeen I am trying to use pytorch 1 for openfold installation. I have following dependencies,
cudatoolkit 11.6.2 hfc3e2af_13 conda-forge debugpy 1.8.2 py310h76e45a6_0 conda-forge decorator 5.1.1 pyhd8ed1ab_0 conda-forge deepspeed 0.12.4 pypi_0 pypi dllogger 1.0.0 pypi_0 pypi dm-tree 0.1.6 pypi_0 pypi flash-attn 2.5.9.post1 pypi_0 pypi hhsuite 3.3.0 py310pl5321hc31ed2c_11 bioconda hjson 3.1.0 pypi_0 pypi hmmer 3.3.2 hdbdd923_4 bioconda kalign2 2.04 h031d066_6 bioconda mkl 2022.1.0 h84fe81f_915 https://aws-ml-conda.s3.us-west-2.amazonaws.com mkl-devel 2022.1.0 ha770c72_916 conda-forge mkl-include 2022.1.0 h84fe81f_915 conda-forge ml-collections 0.1.1 pyhd8ed1ab_0 conda-forge mmseqs2 15.6f452 pl5321h6a68c12_2 bioconda modelcif 0.7 pyhd8ed1ab_0 conda-forge numpy 1.26.4 py310hb13e2d6_0 conda-forge python 3.10.14 hd12c33a_0_cpython conda-forge python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge python-tzdata 2024.1 pyhd8ed1ab_0 conda-forge python_abi 3.10 4_cp310 conda-forge pytorch 1.12.1 py3.10_cuda11.6_cudnn8.3.2_0 pytorch pytorch-lightning 2.1.4 pyhd8ed1ab_0 conda-forge
While executing the inference script for single sequence,
from openfold.config import model_config from openfold.data import templates, feature_pipeline, data_pipeline from openfold.data.tools import hhsearch, hmmsearch from openfold.np import protein from openfold.utils.script_utils import (load_models_from_command_line, parse_fasta, run_model, prep_output, relax_protein) from openfold.utils.tensor_utils import tensor_tree_map from openfold.utils.trace_utils import ( pad_feature_dict_seq, trace_model_, )
I am having the following issue,
----> 7 from openfold.utils.script_utils import (load_models_from_command_line, parse_fasta, run_model, 8 prep_output, relax_protein) 9 from openfold.utils.tensor_utils import tensor_tree_map 10 from openfold.utils.trace_utils import ( 11 pad_feature_dict_seq, 12 trace_model_, 13 )
File ~/openfold/openfold/utils/script_utils.py:10 7 import numpy 8 import torch ---> 10 from openfold.model.model import AlphaFold 11 from openfold.np import residue_constants, protein 12 from openfold.np.relax import relax
File ~/openfold/openfold/model/model.py:29 22 from openfold.utils.feats import ( 23 pseudo_beta_fn, 24 build_extra_msa_feat, 25 dgram_from_positions, 26 atom14_to_atom37, 27 ) 28 from openfold.utils.tensor_utils import masked_mean ---> 29 from openfold.model.embedders import ( 30 InputEmbedder, 31 InputEmbedderMultimer, 32 RecyclingEmbedder, 33 TemplateEmbedder, 34 TemplateEmbedderMultimer, 35 ExtraMSAEmbedder, 36 PreembeddingEmbedder, 37 ) 38 from openfold.model.evoformer import EvoformerStack, ExtraMSAStack 39 from openfold.model.heads import AuxiliaryHeads
File ~/openfold/openfold/model/embedders.py:29 22 from openfold.utils import all_atom_multimer 23 from openfold.utils.feats import ( 24 pseudo_beta_fn, 25 dgram_from_positions, 26 build_template_angle_feat, 27 build_template_pair_feat, 28 ) ---> 29 from openfold.model.primitives import Linear, LayerNorm 30 from openfold.model.template import ( 31 TemplatePairStack, 32 TemplatePointwiseAttention, 33 ) 34 from openfold.utils import geometry
File ~/openfold/openfold/model/primitives.py:30 28 fa_is_installed = importlib.util.find_spec("flash_attn") is not None 29 if fa_is_installed: ---> 30 from flash_attn.bert_padding import unpad_input 31 from flash_attn.flash_attn_interface import flash_attn_unpadded_kvpacked_func 33 import torch
File /opt/conda/envs/openf/lib/python3.10/site-packages/flash_attn/init.py:3 1 version = "2.5.9.post1" ----> 3 from flash_attn.flash_attn_interface import ( 4 flash_attn_func, 5 flash_attn_kvpacked_func, 6 flash_attn_qkvpacked_func, 7 flash_attn_varlen_func, 8 flash_attn_varlen_kvpacked_func, 9 flash_attn_varlen_qkvpacked_func, 10 flash_attn_with_kvcache, 11 )
File /opt/conda/envs/openf/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py:10 6 import torch.nn as nn 8 # isort: off 9 # We need to import the CUDA kernels after importing torch ---> 10 import flash_attn_2_cuda as flash_attn_cuda 12 # isort: on 15 def _get_block_size_n(device, head_dim, is_dropout, is_causal): 16 # This should match the block sizes in the CUDA kernel
ImportError: /opt/conda/envs/openf/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl8GPUTrace13gpuTraceStateE
Any help would be appreciated!
I just checked that the pl_upgrades branch on two systems I have access to with pre-installed CUDA 12, and found that they were working for me. Let me know if folks have issues with this environments.
It doesn't appear working now, see issue #477
I also recommend cuda-nvcc=12.4.131 to get it closer to a complete environment, this is recommended over installing it with system packages.