boltz icon indicating copy to clipboard operation
boltz copied to clipboard

Hemichannel Test: Memory Issues (After the Chunking Update)

Open amelie-iska opened this issue 1 year ago • 11 comments

Hi all, just ran into this error on a hemichannel (6 connexin) system (same as before). I can run this prediction with ColabFold, but not with Boltz-1. YAML Input:

version: 1  # Optional, defaults to 1
sequences:
  - protein:
      id: [A,B,C,D,E,F]
      sequence: MGDWSALGRLLDKVQAYSTAGGKVWLSVLFIFRILLLGTAVESAWGDEQSAFVCNTQQPGCENVCYDKSFPISHVRFWVLQIIFVSTPTLLYLAHVFYLMRKEEKLNRKEEELKMVQNEGGNVDMHLKQIEIKKFKYGLEEHGKVKMRGGLLRTYIISILFKSVFEVGFIIIQWYMYGFSLSAIYTCKRDPCPHQVDCFLSRPTEKTIFIWFMLIVSIVSLALNIIELFYVTYKSIKDGIKGKKDPFSATNDAVISGKECGSPKYAYFNGCSSPTAPMSPPGYKLVTGERNPSSCRNYNKQASEQNWANYSAEQNRMGQAGSTISNTHAQPFDFSDEHQNTKKMAPGHEMQPLTILDQRPSSRASSHASSRPRPDDLEI
  - protein:
      id: [G,H,I]
      sequence: FSLESERP
  - ligand:
      id: [J,K,L]
      smiles: CC(C)C[C@H](NC(=O)[C@H](CO)NC(=O)[C@@H](N)Cc1ccccc1)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CO)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N1CCC[C@H]1C(=O)O

Run Command:

boltz predict examples/connexin-peptide.yaml --recycling_steps 20 --diffusion_samples 5 --use_msa_server

Output:

(boltz-1) lily@il-gpu04:~/amelie/Workspace/boltz$ boltz predict examples/connexin-peptide.yaml --recycling_steps 20 --diffusion_samples 5 --use_msa_server
Downloading the model weights to /home/lily/.boltz/boltz1_conf.ckpt. You may change the cache directory with the --cache flag.
Checking input data.
Running predictions for 1 structure
Processing input data.
  0%|                                                                                                                                             | 0/1 [00:00<?, ?it/s]Generating MSA for examples/connexin-peptide.yaml with 2 protein entities.
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:01 remaining: 00:00]
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:00 remaining: 00:00Sleeping for 8s. Reason: PENDING                                                                                                    | 0/300 [elapsed: 00:00 remaining: ?]
                                                                                                                                                                       Sleeping for 7s. Reason: RUNNING                                                                                                | 8/300 [elapsed: 00:09 remaining: 05:33]
                                                                                                                                                                       Sleeping for 9s. Reason: RUNNING                                                                                               | 15/300 [elapsed: 00:16 remaining: 05:15]
                                                                                                                                                                       Sleeping for 9s. Reason: RUNNING                                                                                               | 24/300 [elapsed: 00:26 remaining: 04:59]
                                                                                                                                                                       Sleeping for 8s. Reason: RUNNING                                                                                               | 33/300 [elapsed: 00:35 remaining: 04:47]
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:45 remaining: 00:00]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:48<00:00, 48.80s/it]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/lily/mambaforge/envs/boltz-1/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Predicting DataLoader 0:   0%|                                                                                                                    | 0/1 [00:00<?, ?it/s]| WARNING: ran out of memory, skipping batch
Predicting DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [2:34:20<00:00,  0.00it/s]Number of failed examples: 1
Predicting DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [2:34:20<00:00,  0.00it/s]
(boltz-1) lily@il-gpu04:~/amelie/Workspace/boltz$ 

amelie-iska avatar Nov 29 '24 23:11 amelie-iska

Same here

xinyu-dev avatar Dec 03 '24 01:12 xinyu-dev

same issues with OOM.

YogBN avatar Dec 03 '24 16:12 YogBN

same issue here, OOM with 2-chain protein complex(<500aa in total) on A100 (40 GB)

YaoYinYing avatar Dec 04 '24 09:12 YaoYinYing

We just released v0.3.2 which should address some of these issues. You can update with pip install boltz -U When testing, please remove any existing output folder for your input and run again! Please let us know.

jwohlwend avatar Dec 04 '24 21:12 jwohlwend

v0.3.2 works for my case!!!

YaoYinYing avatar Dec 05 '24 01:12 YaoYinYing

IT WORKED!!! 🔥 🔥 🔥

image

amelie-iska avatar Dec 05 '24 20:12 amelie-iska

I did have to still truncate the last ~140 residues from the C-terminus of the connexins though. So, I ran with this YAML

version: 1  # Optional, defaults to 1
sequences:
  - protein:
      id: [A,B,C,D,E,F]
      sequence: MGDWSALGRLLDKVQAYSTAGGKVWLSVLFIFRILLLGTAVESAWGDEQSAFVCNTQQPGCENVCYDKSFPISHVRFWVLQIIFVSTPTLLYLAHVFYLMRKEEKLNRKEEELKMVQNEGGNVDMHLKQIEIKKFKYGLEEHGKVKMRGGLLRTYIISILFKSVFEVGFIIIQWYMYGFSLSAIYTCKRDPCPHQVDCFLSRPTEKTIFIWFMLIVSIVSLALNIIELFYVTYKSIKDG

# Long disordered C-terminal tail of connexin
# IKGKKDPFSATNDAVISGKECGSPKYAYFNGCSSPTAPMSPPGYKLVTGERNPSSCRNYNKQASEQNWANYSAEQNRMGQAGSTISNTHAQPFDFSDEHQNTKKMAPGHEMQPLTILDQRPSSRASSHASSRPRPDDLEI

# Run command: 
# boltz predict examples/connexin-peptide.yaml --recycling_steps 20  --diffusion_samples 10 --use_msa_server

Also, I am alleviating memory issues by adding this code (below) to src/boltz/main.py...will this help?

import torch
torch.set_float32_matmul_precision('medium')

I'm rerunning with the full 379 residue connexins now and will report back with an update once it either finishes or fails.

amelie-iska avatar Dec 05 '24 20:12 amelie-iska

😔

amelie-iska avatar Dec 05 '24 22:12 amelie-iska

I did have to still truncate the last ~140 residues from the C-terminus of the connexins though. So, I ran with this YAML

version: 1  # Optional, defaults to 1
sequences:
  - protein:
      id: [A,B,C,D,E,F]
      sequence: MGDWSALGRLLDKVQAYSTAGGKVWLSVLFIFRILLLGTAVESAWGDEQSAFVCNTQQPGCENVCYDKSFPISHVRFWVLQIIFVSTPTLLYLAHVFYLMRKEEKLNRKEEELKMVQNEGGNVDMHLKQIEIKKFKYGLEEHGKVKMRGGLLRTYIISILFKSVFEVGFIIIQWYMYGFSLSAIYTCKRDPCPHQVDCFLSRPTEKTIFIWFMLIVSIVSLALNIIELFYVTYKSIKDG

# Long disordered C-terminal tail of connexin
# IKGKKDPFSATNDAVISGKECGSPKYAYFNGCSSPTAPMSPPGYKLVTGERNPSSCRNYNKQASEQNWANYSAEQNRMGQAGSTISNTHAQPFDFSDEHQNTKKMAPGHEMQPLTILDQRPSSRASSHASSRPRPDDLEI

# Run command: 
# boltz predict examples/connexin-peptide.yaml --recycling_steps 20  --diffusion_samples 10 --use_msa_server

Also, I am alleviating memory issues by adding this code (below) to src/boltz/main.py...will this help?

import torch
torch.set_float32_matmul_precision('medium')

I'm rerunning with the full 379 residue connexins now and will report back with an update once it either finishes or fails.

hi! curious about your reason for using --recycling_steps 20 --diffusion_samples 10 - do the results work better compared to the default parameters?

zongmingchua avatar Dec 18 '24 19:12 zongmingchua

Hi @zongmingchua In general, you can expect that raising the recycles will improve output prediction quality. Increasing the number of seeds/samples also increases your chances of getting a good prediction. So, for larger, more complex systems, I generally do not use the default settings. Another thing you might try is increasing the number of timesteps used in the diffusion process, which should also improve quality. All of these things will increase the amount of time it takes to run though. So just keep that in mind.

amelie-iska avatar Dec 18 '24 22:12 amelie-iska

pip install boltz -U

I use Found existing installation: boltz 0.4.1

But still I experience

Checking input data.
Running predictions for 1 structure
Processing input data.
Generating MSA for ../folder_predict_structure/permanent_peptide_008_designed_output_28_TER_added_replaced_to_LIG_element_replaced_to_LIG_packed_1_1.fasta with 3 protein entities.

COMPLETE: 100%|██████████| 450/450 [elapsed: 00:01 remaining: 00:00]
COMPLETE: 100%|██████████| 450/450 [elapsed: 00:00 remaining:Sleeping for 8s. Reason: PENDING
PENDING:   0%|          | 0/450 [elapsed: 00:00 remaining: ?]Sleeping for 8s. Reason: PENDING
PENDING:   0%|          | 0/450 [elapsed: 00:08 remaining: ?]Sleeping for 7s. Reason: PENDING
PENDING:   0%|          | 0/450 [elapsed: 00:16 remaining: ?]Sleeping for 7s. Reason: PENDING
PENDING:   0%|          | 0/450 [elapsed: 00:24 remaining: ?]Sleeping for 6s. Reason: PENDING
PENDING:   0%|          | 0/450 [elapsed: 00:31 remaining: ?]Sleeping for 5s. Reason: PENDING
PENDING:   0%|          | 0/450 [elapsed: 00:37 remaining: ?]Sleeping for 6s. Reason: PENDING

For some sequences, no error. However, for some protein sequences, I see this PENDING error.

kimdn avatar Feb 23 '25 00:02 kimdn