openfold
openfold copied to clipboard
OOM after reducing number of evoformer blocks
Just hoping someone has an idea of what might be going on, because I'm completely puzzled by this behavior.
After reducing "no_blocks" from 48 to 12 I get an OOM error. It's the only thing I change in-between runs. The network is in FP32.
OOM occurs in self.optimizer.backward(loss)
RuntimeError: CUDA out of memory. Tried to allocate 4.50 GiB (GPU 0; 31.75 GiB total capacity; 23.94 GiB already allocated; 1.81 GiB free; 28.35 GiB reserved in total by PyTorch)
Are you saying that it didn't OOM with the full 48 blocks?
Yes
Is this the 256 or 384 setting?
384, but I kept "max_extra_msa": 1024 and "max_msa_clusters": 128 because everything else is too expensive. With 256 it works fine.
I guess reducing no_blocks will actually not have a big effect because we are checkpointing anyway, right? Still, memory increase is weird.
Does torch allocate more total memory when you run it with 48 blocks? Sometimes it seems like torch opportunistically allocates less than it ultimately ends up needing. My hunch is that memory fragmentation, which seems to be happening here, is to blame for the underestimation.
I will check. There might also be a regression between pytorch 1.9 and 1.10: https://github.com/pytorch/pytorch/issues/67680
Yep, the 48 blocks network allocates roughly 1GB more. Downgrading to pytorch 1.9.1 didn't help.
All I can say on this for now is that we're working on more memory-efficient attention. In principle, there's no reason why we shouldn't be able to get it as efficient as AlphaFold's.