boltz icon indicating copy to clipboard operation
boltz copied to clipboard

CUDA out of memory errors

Open RoKant opened this issue 6 months ago • 3 comments

Hi developers,

I'm not sure if this is an issue of our cluster or something boltz specific. I used to be able to run multiple sequences in batch, but started giving out of memory errors. Now it's already giving these errors when running a single sequence.

This is the command: boltz predict WT_restraints.yaml --recycling_steps 10 --diffusion_samples 25 --cache /data/groups/Boltz/ --use_msa_server

This is the input .yaml file: sequences:

  • protein: id: [A1] sequence: HPTLKTPESVTGTWKGDVKIQCIYDPLRGYRQVLVKWLVRHGSDSVTIFLRDSTGDHIQQAKYRGRLKVSHKVPGDVSLQINTLQMDDRNHYTCEVTWQTPDGNQVIRDKIIELRVRK
  • protein: sequence: QVQLVESGGGLVQAGGSLRLSCAASGRTFSSYGMGWFRQAPGKEREFVAAIRWNGGSTYYADSVKGRFTISRDNAKNTVYLQMNSLKPEDTAVYYCAAGRWDKYGSSFQDEYDYWGQGTQVTVSS id: [B1] constraints:
  • pocket: binder: B1 contacts: [ [ A1, 111 ]]

Here's the slurm output: Checking input data. Running predictions for 1 structure Processing input data.

0%| | 0/1 [00:00<?, ?it/s]Generating MSA for WT_mVsig4_restraints.yaml with 2 protein entities.

0%| | 0/300 [elapsed: 00:00 remaining: ?][A

SUBMIT: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A

COMPLETE: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A

COMPLETE: 100%|██████████| 300/300 [elapsed: 00:00 remaining: 00:00][A COMPLETE: 100%|██████████| 300/300 [elapsed: 00:01 remaining: 00:00]

0%| | 0/300 [elapsed: 00:00 remaining: ?][A

SUBMIT: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A

COMPLETE: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A

COMPLETE: 100%|██████████| 300/300 [elapsed: 00:00 remaining: 00:00][A COMPLETE: 100%|██████████| 300/300 [elapsed: 00:02 remaining: 00:00]

100%|██████████| 1/1 [00:03<00:00, 3.97s/it] 100%|██████████| 1/1 [00:03<00:00, 3.97s/it] /data/groups/Boltz/Test/venv/lib/python3.11/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun like so: srun python /data/groups/Boltz/Test/ve ... GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs /data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, tensorboardX has been removed as a dependency of the pytorch_lightning package, due to potential conflicts with other packages in the ML ecosystem. For this reason, logger=True will use CSVLogger as the default logger, unless the tensorboard or tensorboardX packages are found. Please pip install lightning[extra] or one of them to enable TensorBoard support by default You are using a CUDA device ('NVIDIA H100 80GB HBM3') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Traceback (most recent call last): File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl results = self._run(model, ckpt_path=ckpt_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 957, in _run self.strategy.setup(self) File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 154, in setup self.model_to_device() File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/strategies/single_device.py", line 79, in model_to_device self.model.to(self.root_device) File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 55, in to return super().to(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1355, in to return self._apply(convert) ^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 942, in _apply param_applied = fn(param) ^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1341, in convert return t.to( ^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 79.22 GiB of which 8.31 MiB is free. Process 1427700 has 77.45 GiB memory in use. Including non-PyTorch memory, this process has 1.74 GiB memory in use. Of the allocated memory 1.14 GiB is allocated by PyTorch, and 100.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/groups/Boltz/Test/venv/bin/boltz", line 8, in sys.exit(cli()) ^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/boltz/main.py", line 765, in predict trainer.predict( File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 858, in predict return call._call_and_handle_interrupt( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 68, in _call_and_handle_interrupt trainer._teardown() File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1004, in _teardown self.strategy.teardown() File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 535, in teardown self.lightning_module.cpu() File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 82, in cpu return super().cpu() ^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1133, in cpu return self._apply(lambda t: t.cpu()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torchmetrics/metric.py", line 907, in _apply _dummy_tensor = fn(torch.zeros(1, device=self.device)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Checking input data. Running predictions for 1 structure Processing input data.

0%| | 0/1 [00:00<?, ?it/s]Generating MSA for WT_hVsig4_restraints.yaml with 2 protein entities.

0%| | 0/300 [elapsed: 00:00 remaining: ?][A

SUBMIT: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A

COMPLETE: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A

COMPLETE: 100%|██████████| 300/300 [elapsed: 00:00 remaining: 00:00][A COMPLETE: 100%|██████████| 300/300 [elapsed: 00:01 remaining: 00:00]

0%| | 0/300 [elapsed: 00:00 remaining: ?][A

SUBMIT: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A

COMPLETE: 0%| | 0/300 [elapsed: 00:00 remaining: ?][A

COMPLETE: 100%|██████████| 300/300 [elapsed: 00:00 remaining: 00:00][A COMPLETE: 100%|██████████| 300/300 [elapsed: 00:02 remaining: 00:00]

100%|██████████| 1/1 [00:03<00:00, 3.81s/it] 100%|██████████| 1/1 [00:03<00:00, 3.81s/it] /data/groups/Boltz/Test/venv/lib/python3.11/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The srun command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with srun like so: srun python /data/groups/Boltz/Test/ve ... GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs /data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, tensorboardX has been removed as a dependency of the pytorch_lightning package, due to potential conflicts with other packages in the ML ecosystem. For this reason, logger=True will use CSVLogger as the default logger, unless the tensorboard or tensorboardX packages are found. Please pip install lightning[extra] or one of them to enable TensorBoard support by default You are using a CUDA device ('NVIDIA H100 80GB HBM3') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Traceback (most recent call last): File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl results = self._run(model, ckpt_path=ckpt_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 957, in _run self.strategy.setup(self) File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 154, in setup self.model_to_device() File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/strategies/single_device.py", line 79, in model_to_device self.model.to(self.root_device) File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 55, in to return super().to(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1355, in to return self._apply(convert) ^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 942, in _apply param_applied = fn(param) ^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1341, in convert return t.to( ^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 79.22 GiB of which 8.31 MiB is free. Process 1427700 has 77.45 GiB memory in use. Including non-PyTorch memory, this process has 1.74 GiB memory in use. Of the allocated memory 1.14 GiB is allocated by PyTorch, and 100.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/groups/Boltz/Test/venv/bin/boltz", line 8, in sys.exit(cli()) ^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/boltz/main.py", line 765, in predict trainer.predict( File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 858, in predict return call._call_and_handle_interrupt( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 68, in _call_and_handle_interrupt trainer._teardown() File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1004, in _teardown self.strategy.teardown() File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 535, in teardown self.lightning_module.cpu() File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 82, in cpu return super().cpu() ^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1133, in cpu return self._apply(lambda t: t.cpu()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 915, in _apply module._apply(fn) File "/data/groups/Boltz/Test/venv/lib/python3.11/site-packages/torchmetrics/metric.py", line 907, in _apply _dummy_tensor = fn(torch.zeros(1, device=self.device)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

RoKant avatar Jun 06 '25 09:06 RoKant

Hello, I also encountered such an error during use. Have you solved it? Thank you

Image

CH00101 avatar Jun 10 '25 02:06 CH00101

One run went fine, with a new installation (now Boltz-2 I guess). After than I'm running into the CUDA out of memory error again.

No clue what's going on.

RoKant avatar Jun 10 '25 12:06 RoKant

Running it using slurm, took the tip from the output and using the following command: srun boltz predict ./Fastas/ --recycling_steps 10 --diffusion_samples 10 --cache /data/groups/Boltz2/ --use_msa_server

instead of: boltz predict ./Fastas/ --recycling_steps 10 --diffusion_samples 10 --cache /data/groups/Boltz2/ --use_msa_server

Seems to run now, but no clue if that was the reason it got fixed. Running on a cluster of GPUs, could have been anything and might come back to haunt me again.

RoKant avatar Jun 11 '25 08:06 RoKant