CUDA out of memory errors
Hi developers,
I'm not sure if this is an issue of our cluster or something boltz specific. I used to be able to run multiple sequences in batch, but started giving out of memory errors. Now it's already giving these errors when running a single sequence.
This is the command: boltz predict WT_restraints.yaml --recycling_steps 10 --diffusion_samples 25 --cache /data/groups/Boltz/ --use_msa_server
This is the input .yaml file: sequences:
- protein: id: [A1] sequence: HPTLKTPESVTGTWKGDVKIQCIYDPLRGYRQVLVKWLVRHGSDSVTIFLRDSTGDHIQQAKYRGRLKVSHKVPGDVSLQINTLQMDDRNHYTCEVTWQTPDGNQVIRDKIIELRVRK
- protein: sequence: QVQLVESGGGLVQAGGSLRLSCAASGRTFSSYGMGWFRQAPGKEREFVAAIRWNGGSTYYADSVKGRFTISRDNAKNTVYLQMNSLKPEDTAVYYCAAGRWDKYGSSFQDEYDYWGQGTQVTVSS id: [B1] constraints:
- pocket: binder: B1 contacts: [ [ A1, 111 ]]
Here's the slurm output: Checking input data. Running predictions for 1 structure Processing input data.
0%| | 0/1 [00:00<?, ?it/s]Generating MSA for WT_mVsig4_restraints.yaml with 2 protein entities.
0%| | 0/300 [elapsed: 00:00 remaining: ?][A
SUBMIT: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 100%|██████████| 300/300 [elapsed: 00:00 remaining: 00:00][A COMPLETE: 100%|██████████| 300/300 [elapsed: 00:01 remaining: 00:00]
0%| | 0/300 [elapsed: 00:00 remaining: ?][A
SUBMIT: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 100%|██████████| 300/300 [elapsed: 00:00 remaining: 00:00][A COMPLETE: 100%|██████████| 300/300 [elapsed: 00:02 remaining: 00:00]
100%|██████████| 1/1 [00:03<00:00, 3.97s/it]
100%|██████████| 1/1 [00:03<00:00, 3.97s/it]
/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun like so: srun python /data/groups/Boltz/Test/ve ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, tensorboardX has been removed as a dependency of the pytorch_lightning package, due to potential conflicts with other packages in the ML ecosystem. For this reason, logger=True will use CSVLogger as the default logger, unless the tensorboard or tensorboardX packages are found. Please pip install lightning[extra] or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA H100 80GB HBM3') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Traceback (most recent call last):
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl
results = self._run(model, ckpt_path=ckpt_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 957, in _run
self.strategy.setup(self)
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 154, in setup
self.model_to_device()
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/strategies/single_device.py", line 79, in model_to_device
self.model.to(self.root_device)
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 55, in to
return super().to(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1355, in to
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply
module._apply(fn)
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply
module._apply(fn)
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 942, in _apply
param_applied = fn(param)
^^^^^^^^^
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1341, in convert
return t.to(
^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 79.22 GiB of which 8.31 MiB is free. Process 1427700 has 77.45 GiB memory in use. Including non-PyTorch memory, this process has 1.74 GiB memory in use. Of the allocated memory 1.14 GiB is allocated by PyTorch, and 100.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/groups/Boltz/Test/venv/bin/boltz", line 8, in TORCH_USE_CUDA_DSA to enable device-side assertions.
Checking input data. Running predictions for 1 structure Processing input data.
0%| | 0/1 [00:00<?, ?it/s]Generating MSA for WT_hVsig4_restraints.yaml with 2 protein entities.
0%| | 0/300 [elapsed: 00:00 remaining: ?][A
SUBMIT: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 100%|██████████| 300/300 [elapsed: 00:00 remaining: 00:00][A COMPLETE: 100%|██████████| 300/300 [elapsed: 00:01 remaining: 00:00]
0%| | 0/300 [elapsed: 00:00 remaining: ?][A
SUBMIT: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A
COMPLETE: 100%|██████████| 300/300 [elapsed: 00:00 remaining: 00:00][A COMPLETE: 100%|██████████| 300/300 [elapsed: 00:02 remaining: 00:00]
100%|██████████| 1/1 [00:03<00:00, 3.81s/it]
100%|██████████| 1/1 [00:03<00:00, 3.81s/it]
/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun like so: srun python /data/groups/Boltz/Test/ve ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, tensorboardX has been removed as a dependency of the pytorch_lightning package, due to potential conflicts with other packages in the ML ecosystem. For this reason, logger=True will use CSVLogger as the default logger, unless the tensorboard or tensorboardX packages are found. Please pip install lightning[extra] or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA H100 80GB HBM3') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Traceback (most recent call last):
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl
results = self._run(model, ckpt_path=ckpt_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 957, in _run
self.strategy.setup(self)
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 154, in setup
self.model_to_device()
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/strategies/single_device.py", line 79, in model_to_device
self.model.to(self.root_device)
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 55, in to
return super().to(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1355, in to
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply
module._apply(fn)
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply
module._apply(fn)
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 942, in _apply
param_applied = fn(param)
^^^^^^^^^
File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1341, in convert
return t.to(
^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 79.22 GiB of which 8.31 MiB is free. Process 1427700 has 77.45 GiB memory in use. Including non-PyTorch memory, this process has 1.74 GiB memory in use. Of the allocated memory 1.14 GiB is allocated by PyTorch, and 100.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/groups/Boltz/Test/venv/bin/boltz", line 8, in TORCH_USE_CUDA_DSA to enable device-side assertions.
Hello, I also encountered such an error during use. Have you solved it? Thank you
One run went fine, with a new installation (now Boltz-2 I guess). After than I'm running into the CUDA out of memory error again.
No clue what's going on.
Running it using slurm, took the tip from the output and using the following command: srun boltz predict ./Fastas/ --recycling_steps 10 --diffusion_samples 10 --cache /data/groups/Boltz2/ --use_msa_server
instead of: boltz predict ./Fastas/ --recycling_steps 10 --diffusion_samples 10 --cache /data/groups/Boltz2/ --use_msa_server
Seems to run now, but no clue if that was the reason it got fixed. Running on a cluster of GPUs, could have been anything and might come back to haunt me again.