foundry icon indicating copy to clipboard operation
foundry copied to clipboard

out of memory

Open dhiraj82 opened this issue 3 weeks ago • 2 comments

I am trying to design binders (80-160 res.) against my protein (~650 AA) but often getting this error. I have 46 GB memory / gpu and sometime its happening that out of 4 instances / node, 2 or 3 are getting killed and only one or two are running. Its happening on several different kind of GPU like RTX 8000, A10, L40s. I understand that 600 residues are large but I can not make it smaller. its hetero tetramer and I am trying to design binder which interacts with all four subunits. is this behaviour normal or I made mistake during installation? I installed it through apptainer container in rocky linux 9.6 and cuda version is 12.2.

Thank you Dhiraj

OutOfMemoryError: CUDA out of memory. Tried to allocate 5.55 GiB. GPU 0 has a total capacity of 44.35 GiB of which 5.03 GiB is free. Including non-PyTorch memory, this process has 39.31 GiB memory in use. Of the allocated memory 29.72 GiB is allocated by PyTorch, and 9.28 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

dhiraj82 avatar Dec 09 '25 04:12 dhiraj82

Have you tried running with low_memory_mode=True?

Ubiquinone-dot avatar Dec 13 '25 08:12 Ubiquinone-dot

Hi Jasper I tried. do you think 600 residues long target is too big for 48 gb vram?

Thanks Dhiraj

dhiraj82 avatar Dec 13 '25 18:12 dhiraj82

Have you tried lowering the number of sample batches to "2" ("diffusion_batch_size")?. I think the standards is set to 8. Which means - from my understanding - that rfd3 is producing 8 structures at a time. I had the same problems with overloaded memory too, and was able to get it running by doing this:

diffusion_batch_size=2 skip_existing="True" n_batches=25

Hope that helps. Might even have to go down to "1".

Best

Toby

Cookiemaster33 avatar Dec 14 '25 16:12 Cookiemaster33

I encountered the same issue. I noticed that over time, memory usage keeps increasing. Is there a way to clean up memory while the program is running?

kangsgo avatar Dec 17 '25 09:12 kangsgo

There certainly shouldn't be an increase in the memory usage over time - do you have more details? How many batches does this happen across? If you're returning atom arrays then it might accumulate those but otherwise they get written to disk.

Ubiquinone-dot avatar Dec 17 '25 10:12 Ubiquinone-dot

@dhiraj82, we're aiming to release guidance on compiling the RMSNorm's soon, with those you should be able to run 600 residues on 46 Gbs (roughly halves peak memory consumption and makes it twice as fast) - but it'll still be a bit tight. Also +1 to @Cookiemaster33's suggestion

Ubiquinone-dot avatar Dec 17 '25 10:12 Ubiquinone-dot

Hi Jasper and @Cookiemaster33, I did tried to reduce the batch size to 1 without much help. I am able to run it on A100 with 80 gb ram but these are limited in numbers in our HPC so difficult to get into. I got the job done but it will be helpful to be able to run these computation on other GPU with less memory.

dhiraj82 avatar Dec 18 '25 02:12 dhiraj82

I want to design a total 1300 AA and got out of memory for A100 80gb, do you have any suggestions?

ttnnguyen1 avatar Dec 18 '25 04:12 ttnnguyen1

@ttnnguyen1 1300 AA is too big a target and even if you are able to run it without memory issue, it will take long time to get enough binders generated for filtering and validation. I would suggest to truncate it to the domain you are targeting. my target is 600 aa because I am targeting heterotetrameric complex for stabilization purpose and I want binders to make contact with all four subunit. I am still using minimal number of residues (one small domain from each) required for heterotetramerization and targeting. I am not sure if rfdiffusion will like cutting in the domain in a way which generates several gaps.

dhiraj82 avatar Dec 18 '25 04:12 dhiraj82