The provided environment yaml for SE3nv installs pytorch-cpu version, when the installed cuda version is not 11.1
A different driver version (in my case: Driver Version: 510.108.03 CUDA Version: 11.6) fails to install the cuda version of pytorch, resulting in the runtime error when trying to run all of the examples. This is probably due to the fact that there's no cudatoolkit version 11.1 (which is required by the original SE3nv yaml) for this driver.
To solve it, I installed a cudatoolkit11.6 and pytorch1.12.1, I attached the export of my environment, in case someone else encounters this problem
Thanks for providing this yml for CUDA 11.6! Due to the variation in GPU types and drivers that users have access to, we are not able to make one yml that will run on all setups. As such, we are only providing the yml with support for CUDA 11.1 and leaving it to each user to customize it to work on their setups. This customization will involve changing the cudatoolkit and (possibly) the PyTorch version as you have done here.
A statement in the README will be helpful. 😊
would you accept user contribution adapting SE3nv.yml for cuda11.x? such as SE3nv-cuda11.6.yml
May I ask how much time a simple unconditional design should take? I want to check if it is using resources correctly On a t4 gpu and below takes 1 minute:
RFdiffusion-main/scripts/run_inference.py 'contigmap.contigs=[150-150]' inference.output_prefix=test_outputs/test inference.num_designs=1
1 minute seems correct for 150 residues on a Turing GPU. Newer GPUs will run inference much faster
@truatpasteurdotfr, yes we would accept user contributions for different versions of the .yml file.
Like what gpus?
My GPU is Tesla T4. is this the turinng you mentioned? @nrbennet +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 46C P0 35W / 70W | 10767MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
The "T" in T4 stands for Turing which means this GPU has the Turing architecture. Since then NVIDIA has come out with the Ampere, Ada Lovelace, and Hopper architectures (in that order). These newer architectures are a lot faster at inference.
@nrbennet Thak you for your answer. One more question, How could I check if it uses the GPU correctly?
I tried to run it inside a docker in two cases
- CPU only
- CUDA available
But don't see much performance difference
Here are some yml files for cuda 11.1/11.6 and 11.7 which were compatible with dgl)
- https://github.com/truatpasteurdotfr/RFdiffusion/blob/main/env/SE3nv-cuda11.1.yml
- https://github.com/truatpasteurdotfr/RFdiffusion/blob/main/env/SE3nv-cuda11.6.yml
- https://github.com/truatpasteurdotfr/RFdiffusion/blob/main/env/SE3nv-cuda11.7.yml
I have also the "frozen" version listing the explicit packages:
- https://github.com/truatpasteurdotfr/RFdiffusion/blob/main/env/20230406-1646-SE3nv-cuda111-conda-list--explicit.yml
- https://github.com/truatpasteurdotfr/RFdiffusion/blob/main/env/20230406-1646-SE3nv-cuda116-conda-list--explicit.yml
- https://github.com/truatpasteurdotfr/RFdiffusion/blob/main/env/20230406-1646-SE3nv-cuda117-conda-list--explicit.yml
These where produced with mamba which was much faster to resolve the dependancies that conda, of course ymmv.
My server:NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 , What can I do , which cudatoolkit and pytorch version should I install.
It takes me very long time when running "conda env create -f env/SE3nv.yml", staying in Solving environment. Any body knows the reason?
I had it too when I played with this and it could be fixed by either re-ordering packages in SE3nv.yml or using mamba instead of conda.
I just install the annaconda... could you tell me how to re-ordering packages in SE3nv.yml in detail. I just don't know how to do it.
Unfortunately I can't recall what was the order that worked. Try different orderings or just install mamba.
OK, Thanks, I will try mamba then.
Hey! Just wanted to know if anyone made progress here.
I have Driver Version: 535.86.10 and CUDA Version: 12.2, and the Solving Environment takes forever (50+ minutes before I aborted it). Is Mamba the way to go here? I need to install this software asap
I don't know if it will work in your case, but using mamba (or --solver libmamba) does help a bunch on complex installs.
That said, I'd recommended installing RFDiffusion into a fresh conda environment, rather than trying to combine it will a large pre-existing environment. (Doing so will also help with solving speed.)
This happened at the "conda env create -f env/SE3nv.yml" step, so I'm not entirely sure what you mean by creating a fresh environment. My approach for now would be to install the packages in a new environment I just created manually since it's not really much in the .yml anyway. I'll see if that works, then check out mamba if it doesn't, and report back with any findings
So with Mamba it worked as far as the installation goes but now there's problems with the CUDA. A colleague tried to run the scripts and got a ModuleNotFoundError for rfdiffusion to which I added the necessary PYTHONPATH in the environment. Now, when I tried to run the "Basic execution - an unconditional monomer" with ./scripts/run_inference.py 'contigmap.contigs=[150-150]' inference.output_prefix=test_outputs/test inference.num_designs=10 it showed that it couldn't detect a GPU (despite 10 3080s being installed), ran for some time before producing the following error output:
[2023-10-03 13:50:00,649][rfdiffusion.inference.model_runners][INFO] - Sequence init: ------------------------------------------------------------------------------------------------------------------------------------------------------
Error executing job with overrides: ['contigmap.contigs=[150-150]', 'inference.output_prefix=test_outputs/test', 'inference.num_designs=10']
Traceback (most recent call last):
File "/software/RFdiffusion/./scripts/run_inference.py", line 94, in main
px0, x_t, seq_t, plddt = sampler.sample_step(
File "/software/RFdiffusion/rfdiffusion/inference/model_runners.py", line 664, in sample_step
msa_prev, pair_prev, px0, state_prev, alpha, logits, plddt = self.model(msa_masked,
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/software/RFdiffusion/rfdiffusion/RoseTTAFoldModel.py", line 102, in forward
msa, pair, R, T, alpha_s, state = self.simulator(seq, msa_latent, msa_full, pair, xyz[:,:,:3],
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/software/RFdiffusion/rfdiffusion/Track_module.py", line 420, in forward
msa_full, pair, R_in, T_in, state, alpha = self.extra_block[i_m](msa_full,
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/software/RFdiffusion/rfdiffusion/Track_module.py", line 332, in forward
R, T, state, alpha = self.str2str(msa, pair, R_in, T_in, xyz, state, idx, motif_mask=motif_mask, top_k=0)
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 141, in decorate_autocast
return func(*args, **kwargs)
File "/software/RFdiffusion/rfdiffusion/Track_module.py", line 266, in forward
shift = self.se3(G, node.reshape(B*L, -1, 1), l1_feats, edge_feats)
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/software/RFdiffusion/rfdiffusion/SE3_network.py", line 83, in forward
return self.se3(G, node_features, edge_features)
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/transformer.py", line 140, in forward
basis = basis or get_basis(graph.edata['rel_pos'], max_degree=self.max_degree, compute_gradients=False,
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/basis.py", line 166, in get_basis
with nvtx_range('spherical harmonics'):
File "/software/anaconda/envs/SE3nv/lib/python3.9/contextlib.py", line 119, in __enter__
return next(self.gen)
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/nvtx.py", line 59, in range
range_push(msg.format(*args, **kwargs))
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/nvtx.py", line 28, in range_push
return _nvtx.rangePushA(msg)
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/nvtx.py", line 9, in _fail
raise RuntimeError("NVTX functions not installed. Are you sure you have a CUDA build?")
RuntimeError: NVTX functions not installed. Are you sure you have a CUDA build?
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Another user in issue #38 mentioned experiencing similar and solving it with another yml from here, for which I'd need assistance
@zimb3l have you tried one of the yml files from https://github.com/RosettaCommons/RFdiffusion/issues/14#issuecomment-1505150974 ?
@LiorZ Thank you very much for sharing the setting!
When RFdiffusion first came out I had no end of installation issues that had to do with versioning and getting it to run on cuda. I had tried to install numerous versions and always ran into another problem, but I guess something somewhere was updated because this time the installation was fairly painless with only one additional line fixing the original installer.
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
As with the original poster, this replaced a cpu version on my computer. Maybe that will help someone. It seemed to be enough to get the basic execution (unconditional monomer) running without issue for me.
I do wish the causes were more clear to me. It feels like black boxes running on magic and hope.