GPU Memory Limit issue

Open SamFreemanFox opened this issue 6 months ago • 0 comments

Currently running into GPU memory issues whilst trying to predict protein structure with the following traceback:

boltz predict TEST_PROTEIN.fasta --use_msa_server Checking input data. All inputs are already processed. Using bfloat16 Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs /home/antigenteam/miniconda3/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:76: Starting from v1.9.0, tensorboardX has been removed as a dependency of the pytorch_lightning package, due to potential conflicts with other packages in the ML ecosystem. For this reason, logger=True will use CSVLogger as the default logger, unless the tensorboard or tensorboardX packages are found. Please pip install lightning[extra] or one of them to enable TensorBoard support by default Running structure prediction for 1 input. /home/antigenteam/miniconda3/lib/python3.12/site-packages/pytorch_lightning/utilities/migration/utils.py:56: The loaded checkpoint was produced with Lightning v2.5.0.post0, which is newer than your current Lightning version: v2.5.0 You are using a CUDA device ('NVIDIA GeForce RTX 4080 Laptop GPU') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Predicting DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]| WARNING: ran out of memory, skipping batch Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 0.56it/s]Number of failed examples: 1 Predicting DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 0.56it/s]

Monitoring GPU memory I see the GPU memory spikes to 100% whilst running the first step. Is there a way to limit the memory required to avoid this?

Jun 10 '25 15:06 SamFreemanFox