out of memory
I am trying to design binders (80-160 res.) against my protein (~650 AA) but often getting this error. I have 46 GB memory / gpu and sometime its happening that out of 4 instances / node, 2 or 3 are getting killed and only one or two are running. Its happening on several different kind of GPU like RTX 8000, A10, L40s. I understand that 600 residues are large but I can not make it smaller. its hetero tetramer and I am trying to design binder which interacts with all four subunits. is this behaviour normal or I made mistake during installation? I installed it through apptainer container in rocky linux 9.6 and cuda version is 12.2.
Thank you Dhiraj
OutOfMemoryError: CUDA out of memory. Tried to allocate 5.55 GiB. GPU 0 has a total capacity of 44.35 GiB of which 5.03 GiB is free. Including non-PyTorch memory, this process has 39.31 GiB memory in use. Of the allocated memory 29.72 GiB is allocated by PyTorch, and 9.28 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management
Have you tried running with low_memory_mode=True?
Hi Jasper I tried. do you think 600 residues long target is too big for 48 gb vram?
Thanks Dhiraj
Have you tried lowering the number of sample batches to "2" ("diffusion_batch_size")?. I think the standards is set to 8. Which means - from my understanding - that rfd3 is producing 8 structures at a time. I had the same problems with overloaded memory too, and was able to get it running by doing this:
diffusion_batch_size=2 skip_existing="True" n_batches=25
Hope that helps. Might even have to go down to "1".
Best
Toby
I encountered the same issue. I noticed that over time, memory usage keeps increasing. Is there a way to clean up memory while the program is running?
There certainly shouldn't be an increase in the memory usage over time - do you have more details? How many batches does this happen across? If you're returning atom arrays then it might accumulate those but otherwise they get written to disk.
@dhiraj82, we're aiming to release guidance on compiling the RMSNorm's soon, with those you should be able to run 600 residues on 46 Gbs (roughly halves peak memory consumption and makes it twice as fast) - but it'll still be a bit tight. Also +1 to @Cookiemaster33's suggestion
Hi Jasper and @Cookiemaster33, I did tried to reduce the batch size to 1 without much help. I am able to run it on A100 with 80 gb ram but these are limited in numbers in our HPC so difficult to get into. I got the job done but it will be helpful to be able to run these computation on other GPU with less memory.
I want to design a total 1300 AA and got out of memory for A100 80gb, do you have any suggestions?
@ttnnguyen1 1300 AA is too big a target and even if you are able to run it without memory issue, it will take long time to get enough binders generated for filtering and validation. I would suggest to truncate it to the domain you are targeting. my target is 600 aa because I am targeting heterotetrameric complex for stabilization purpose and I want binders to make contact with all four subunit. I am still using minimal number of residues (one small domain from each) required for heterotetramerization and targeting. I am not sure if rfdiffusion will like cutting in the domain in a way which generates several gaps.