RFdiffusion icon indicating copy to clipboard operation
RFdiffusion copied to clipboard

The provided environment yaml for SE3nv installs pytorch-cpu version, when the installed cuda version is not 11.1

Open LiorZ opened this issue 2 years ago • 23 comments

A different driver version (in my case: Driver Version: 510.108.03 CUDA Version: 11.6) fails to install the cuda version of pytorch, resulting in the runtime error when trying to run all of the examples. This is probably due to the fact that there's no cudatoolkit version 11.1 (which is required by the original SE3nv yaml) for this driver.

To solve it, I installed a cudatoolkit11.6 and pytorch1.12.1, I attached the export of my environment, in case someone else encounters this problem

environment.yaml.gz

LiorZ avatar Apr 02 '23 07:04 LiorZ

Thanks for providing this yml for CUDA 11.6! Due to the variation in GPU types and drivers that users have access to, we are not able to make one yml that will run on all setups. As such, we are only providing the yml with support for CUDA 11.1 and leaving it to each user to customize it to work on their setups. This customization will involve changing the cudatoolkit and (possibly) the PyTorch version as you have done here.

nrbennet avatar Apr 03 '23 00:04 nrbennet

A statement in the README will be helpful. 😊

longlongman avatar Apr 03 '23 12:04 longlongman

would you accept user contribution adapting SE3nv.yml for cuda11.x? such as SE3nv-cuda11.6.yml

truatpasteurdotfr avatar Apr 06 '23 13:04 truatpasteurdotfr

May I ask how much time a simple unconditional design should take? I want to check if it is using resources correctly On a t4 gpu and below takes 1 minute:

RFdiffusion-main/scripts/run_inference.py 'contigmap.contigs=[150-150]' inference.output_prefix=test_outputs/test inference.num_designs=1

smilenaderi avatar Apr 06 '23 19:04 smilenaderi

1 minute seems correct for 150 residues on a Turing GPU. Newer GPUs will run inference much faster

nrbennet avatar Apr 06 '23 19:04 nrbennet

@truatpasteurdotfr, yes we would accept user contributions for different versions of the .yml file.

nrbennet avatar Apr 06 '23 19:04 nrbennet

Like what gpus?

My GPU is Tesla T4. is this the turinng you mentioned? @nrbennet +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 46C P0 35W / 70W | 10767MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

smilenaderi avatar Apr 06 '23 19:04 smilenaderi

The "T" in T4 stands for Turing which means this GPU has the Turing architecture. Since then NVIDIA has come out with the Ampere, Ada Lovelace, and Hopper architectures (in that order). These newer architectures are a lot faster at inference.

nrbennet avatar Apr 06 '23 20:04 nrbennet

@nrbennet Thak you for your answer. One more question, How could I check if it uses the GPU correctly?

I tried to run it inside a docker in two cases

  • CPU only
  • CUDA available

But don't see much performance difference

smilenaderi avatar Apr 07 '23 13:04 smilenaderi

Here are some yml files for cuda 11.1/11.6 and 11.7 which were compatible with dgl)

  • https://github.com/truatpasteurdotfr/RFdiffusion/blob/main/env/SE3nv-cuda11.1.yml
  • https://github.com/truatpasteurdotfr/RFdiffusion/blob/main/env/SE3nv-cuda11.6.yml
  • https://github.com/truatpasteurdotfr/RFdiffusion/blob/main/env/SE3nv-cuda11.7.yml

I have also the "frozen" version listing the explicit packages:

  • https://github.com/truatpasteurdotfr/RFdiffusion/blob/main/env/20230406-1646-SE3nv-cuda111-conda-list--explicit.yml
  • https://github.com/truatpasteurdotfr/RFdiffusion/blob/main/env/20230406-1646-SE3nv-cuda116-conda-list--explicit.yml
  • https://github.com/truatpasteurdotfr/RFdiffusion/blob/main/env/20230406-1646-SE3nv-cuda117-conda-list--explicit.yml

These where produced with mamba which was much faster to resolve the dependancies that conda, of course ymmv.

truatpasteurdotfr avatar Apr 12 '23 12:04 truatpasteurdotfr

My server:NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 , What can I do , which cudatoolkit and pytorch version should I install.

Githubguanxudong avatar Aug 21 '23 05:08 Githubguanxudong

It takes me very long time when running "conda env create -f env/SE3nv.yml", staying in Solving environment. Any body knows the reason?

ChengkuiZhao avatar Aug 30 '23 06:08 ChengkuiZhao

I had it too when I played with this and it could be fixed by either re-ordering packages in SE3nv.yml or using mamba instead of conda.

jkosinski avatar Aug 30 '23 07:08 jkosinski

I just install the annaconda... could you tell me how to re-ordering packages in SE3nv.yml in detail. I just don't know how to do it.

ChengkuiZhao avatar Aug 30 '23 07:08 ChengkuiZhao

Unfortunately I can't recall what was the order that worked. Try different orderings or just install mamba.

jkosinski avatar Aug 30 '23 07:08 jkosinski

OK, Thanks, I will try mamba then.

ChengkuiZhao avatar Aug 30 '23 07:08 ChengkuiZhao

Hey! Just wanted to know if anyone made progress here.

I have Driver Version: 535.86.10 and CUDA Version: 12.2, and the Solving Environment takes forever (50+ minutes before I aborted it). Is Mamba the way to go here? I need to install this software asap

zimb3l avatar Sep 28 '23 12:09 zimb3l

I don't know if it will work in your case, but using mamba (or --solver libmamba) does help a bunch on complex installs.

That said, I'd recommended installing RFDiffusion into a fresh conda environment, rather than trying to combine it will a large pre-existing environment. (Doing so will also help with solving speed.)

roccomoretti avatar Sep 28 '23 13:09 roccomoretti

This happened at the "conda env create -f env/SE3nv.yml" step, so I'm not entirely sure what you mean by creating a fresh environment. My approach for now would be to install the packages in a new environment I just created manually since it's not really much in the .yml anyway. I'll see if that works, then check out mamba if it doesn't, and report back with any findings

zimb3l avatar Sep 28 '23 13:09 zimb3l

So with Mamba it worked as far as the installation goes but now there's problems with the CUDA. A colleague tried to run the scripts and got a ModuleNotFoundError for rfdiffusion to which I added the necessary PYTHONPATH in the environment. Now, when I tried to run the "Basic execution - an unconditional monomer" with ./scripts/run_inference.py 'contigmap.contigs=[150-150]' inference.output_prefix=test_outputs/test inference.num_designs=10 it showed that it couldn't detect a GPU (despite 10 3080s being installed), ran for some time before producing the following error output:

[2023-10-03 13:50:00,649][rfdiffusion.inference.model_runners][INFO] - Sequence init: ------------------------------------------------------------------------------------------------------------------------------------------------------
Error executing job with overrides: ['contigmap.contigs=[150-150]', 'inference.output_prefix=test_outputs/test', 'inference.num_designs=10']
Traceback (most recent call last):
  File "/software/RFdiffusion/./scripts/run_inference.py", line 94, in main
    px0, x_t, seq_t, plddt = sampler.sample_step(
  File "/software/RFdiffusion/rfdiffusion/inference/model_runners.py", line 664, in sample_step
    msa_prev, pair_prev, px0, state_prev, alpha, logits, plddt = self.model(msa_masked,
  File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/software/RFdiffusion/rfdiffusion/RoseTTAFoldModel.py", line 102, in forward
    msa, pair, R, T, alpha_s, state = self.simulator(seq, msa_latent, msa_full, pair, xyz[:,:,:3],
  File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/software/RFdiffusion/rfdiffusion/Track_module.py", line 420, in forward
    msa_full, pair, R_in, T_in, state, alpha = self.extra_block[i_m](msa_full,
  File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/software/RFdiffusion/rfdiffusion/Track_module.py", line 332, in forward
    R, T, state, alpha = self.str2str(msa, pair, R_in, T_in, xyz, state, idx, motif_mask=motif_mask, top_k=0)
  File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 141, in decorate_autocast
    return func(*args, **kwargs)
  File "/software/RFdiffusion/rfdiffusion/Track_module.py", line 266, in forward
    shift = self.se3(G, node.reshape(B*L, -1, 1), l1_feats, edge_feats)
  File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/software/RFdiffusion/rfdiffusion/SE3_network.py", line 83, in forward
    return self.se3(G, node_features, edge_features)
  File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/transformer.py", line 140, in forward
    basis = basis or get_basis(graph.edata['rel_pos'], max_degree=self.max_degree, compute_gradients=False,
  File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/basis.py", line 166, in get_basis
    with nvtx_range('spherical harmonics'):
  File "/software/anaconda/envs/SE3nv/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/nvtx.py", line 59, in range
    range_push(msg.format(*args, **kwargs))
  File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/nvtx.py", line 28, in range_push
    return _nvtx.rangePushA(msg)
  File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/nvtx.py", line 9, in _fail
    raise RuntimeError("NVTX functions not installed. Are you sure you have a CUDA build?")
RuntimeError: NVTX functions not installed. Are you sure you have a CUDA build?

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Another user in issue #38 mentioned experiencing similar and solving it with another yml from here, for which I'd need assistance

zimb3l avatar Oct 03 '23 11:10 zimb3l

@zimb3l have you tried one of the yml files from https://github.com/RosettaCommons/RFdiffusion/issues/14#issuecomment-1505150974 ?

truatpasteurdotfr avatar Oct 06 '23 15:10 truatpasteurdotfr

@LiorZ Thank you very much for sharing the setting!

WenyueDai avatar Feb 15 '24 21:02 WenyueDai

When RFdiffusion first came out I had no end of installation issues that had to do with versioning and getting it to run on cuda. I had tried to install numerous versions and always ran into another problem, but I guess something somewhere was updated because this time the installation was fairly painless with only one additional line fixing the original installer.

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

As with the original poster, this replaced a cpu version on my computer. Maybe that will help someone. It seemed to be enough to get the basic execution (unconditional monomer) running without issue for me.

I do wish the causes were more clear to me. It feels like black boxes running on magic and hope.

TristynAlxander avatar Aug 24 '24 14:08 TristynAlxander