RFdiffusion icon indicating copy to clipboard operation
RFdiffusion copied to clipboard

Cannot install RFDiffusion on aarch 64, Ubuntu 22.04, CUDA 12.8 (GH200 GPUs)

Open navvye opened this issue 4 months ago • 6 comments

I'm having a lot of trouble installing RFDiffusion on aarch 64 (GH200 GPUs). Here are some specs, and the errors I keep facing when I'm trying to install RFDiffusion.

Specs:

Distributor ID: Ubuntu Description: Ubuntu 22.04.5 LTS Release: 22.04 Codename: jammy

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2025 NVIDIA Corporation Built on Wed_Jan_15_19:21:50_PST_2025 Cuda compilation tools, release 12.8, V12.8.61 Build cuda_12.8.r12.8/compiler.35404655_0

|=========================================+========================+======================| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GH200 480GB On | 00000000:DD:00.0 Off | 0 | | N/A 32C P0 81W / 700W | 1MiB / 97871MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

Errors:

  1. There's an error with torch.load(...,) where I have to set weights_only = False in one of the libraries. This is trivially fixable, so not a huge issue.
  2. DGL issues. DGL for some reason doesn't seem to work with my specs. I keep getting this error.

Error executing job with overrides: ['contigmap.contigs=[150-150]', 'inference.output_prefix=test_outputs/test', 'inference.num_designs=10'] Traceback (most recent call last): File "/home/ubuntu/RFdiffusion/./scripts/run_inference.py", line 94, in main px0, x_t, seq_t, plddt = sampler.sample_step( File "/home/ubuntu/RFdiffusion/rfdiffusion/inference/model_runners.py", line 686, in sample_step msa_prev, pair_prev, px0, state_prev, alpha, logits, plddt = self.model(msa_masked, File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/RFdiffusion/rfdiffusion/RoseTTAFoldModel.py", line 103, in forward msa, pair, R, T, alpha_s, state = self.simulator(seq, msa_latent, msa_full, pair, xyz[:,:,:3], File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/RFdiffusion/rfdiffusion/Track_module.py", line 420, in forward msa_full, pair, R_in, T_in, state, alpha = self.extra_block[i_m](msa_full, File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/RFdiffusion/rfdiffusion/Track_module.py", line 332, in forward R, T, state, alpha = self.str2str(msa, pair, R_in, T_in, xyz, state, idx, motif_mask=motif_mask, cyclic_reses=cyclic_reses, top_k=0) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast return func(args, **kwargs) File "/home/ubuntu/RFdiffusion/rfdiffusion/Track_module.py", line 266, in forward shift = self.se3(G, node.reshape(BL, -1, 1), l1_feats, edge_feats) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/RFdiffusion/rfdiffusion/SE3_network.py", line 83, in forward return self.se3(G, node_features, edge_features) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/transformer.py", line 150, in forward node_feats = self.graph_modules(node_feats, edge_feats, graph=graph, basis=basis) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/transformer.py", line 46, in forward input = module(input, *args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/layers/attention.py", line 157, in forward fused_key_value = self.to_key_value(node_features, edge_features, graph, basis) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/se3_transformer-1.0.0-py3.9.egg/se3_transformer/model/layers/convolution.py", line 281, in forward src, dst = graph.edges() File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/dgl/view.py", line 179, in call return self._graph.all_edges(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/dgl/heterograph.py", line 3355, in all_edges src, dst, eid = self._graph.edges(self.get_etype_id(etype), order) File "/home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/dgl/heterograph_index.py", line 645, in edges edge_array = _CAPI_DGLHeteroEdges(self, int(etype), order) File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.call File "dgl/_ffi/_cython/./function.pxi", line 227, in dgl._ffi._cy3.core.FuncCall File "dgl/_ffi/_cython/./function.pxi", line 217, in dgl._ffi._cy3.core.FuncCall3 dgl._ffi.base.DGLError: [21:19:25] /opt/dgl/src/array/array.cc:42: Operator Range does not support cuda device. Stack trace: [bt] (0) /home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/dgl/libdgl.so(+0xe3884) [0xe9c90f303884] [bt] (1) /home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/dgl/libdgl.so(dgl::aten::Range(long, long, unsigned char, DGLContext)+0xf4) [0xe9c90f304b5c] [bt] (2) /home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/dgl/libdgl.so(dgl::UnitGraph::COO::Edges(unsigned long, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) const+0x98) [0xe9c90f6b14e0] [bt] (3) /home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/dgl/libdgl.so(dgl::UnitGraph::Edges(unsigned long, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) const+0xbc) [0xe9c90f6a14f4] [bt] (4) /home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/dgl/libdgl.so(dgl::HeteroGraph::Edges(unsigned long, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) const+0x40) [0xe9c90f5c29b0] [bt] (5) /home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/dgl/libdgl.so(+0x3adb48) [0xe9c90f5cdb48] [bt] (6) /home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/dgl/libdgl.so(DGLFuncCall+0x4c) [0xe9c90f55fa4c] [bt] (7) /home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/dgl/_ffi/_cy3/core.cpython-39-aarch64-linux-gnu.so(+0x15e74) [0xe9c90f1f5e74] [bt] (8) /home/ubuntu/miniconda3/envs/SE3nv/lib/python3.9/site-packages/dgl/_ffi/_cy3/core.cpython-39-aarch64-linux-gnu.so(+0x16640) [0xe9c90f1f6640]

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.


Steps to Reproduce

git clone https://github.com/RosettaCommons/RFdiffusion.git
cd RFdiffusion

conda env create -f env/SE3nv.yml

conda activate SE3nv
cd env/SE3Transformer
pip install --no-cache-dir -r requirements.txt
python setup.py install
cd ../.. # change into the root directory of the repository
pip install -e . # install the rfdiffusion module from the root of the repository


python scripts/run_inference.py \
    inference.output_prefix=test_outputs/test \
    inference.num_designs=10 \
    contigmap.contigs=[150-150]

Any help would be appreciated!

navvye avatar Sep 07 '25 21:09 navvye

I'm not sure why the DGL is not working well with your GPUs, from the 'Steps to Reproduce' it seems like you are following the installation instructions for RFdiffusion correctly.

Have you tried using the Docker Image for RFdiffusion? It may be a work-around for your problem and should work on your system.

rclune avatar Sep 08 '25 17:09 rclune

Tried using the docker image, but it leads to the same DGL error.

If I were to guess, I'd say it's probably because DGL doesn't have good support for aarch 64?

navvye avatar Sep 08 '25 17:09 navvye

Apologies for missing that detail! Our Docker image also does not fully work on AArch64 architectures, though we are aware of the issue and are working on changing the image.

Do you know which version of DGL is installed (Likely dgl-cuda11.1 based on the SE3nv.yml)? It is possible that updating it would get RFdiffusion working on your system, though it may also cause other issues.

rclune avatar Sep 08 '25 17:09 rclune

I've tried dgl-cuda12.8, but that doesn't work well either. I shall try once again today and update you with any changes!

navvye avatar Sep 08 '25 18:09 navvye

Okay I tried installing dgl using CUDA 12.4, and 12.8 on aarch64, but that doesn't work either.

However switching from aarch64 -> x86_64 on a different VM works.

It would be great if you guys could add support for aarch64, since it's pretty widely used with GH200s/B200s.

navvye avatar Sep 09 '25 07:09 navvye

Glad you were able to work, even if it meant using a VM.

Adding support for AArch/ARM systems is on our list of things to work on, we will keep the community updated with our progress.

rclune avatar Sep 09 '25 15:09 rclune