openfold OOM after reducing number of evoformer blocks

OOM after reducing number of evoformer blocks

Open lhatsk opened this issue 3 years ago • 8 comments

Just hoping someone has an idea of what might be going on, because I'm completely puzzled by this behavior.

After reducing "no_blocks" from 48 to 12 I get an OOM error. It's the only thing I change in-between runs. The network is in FP32.

OOM occurs in self.optimizer.backward(loss)

RuntimeError: CUDA out of memory. Tried to allocate 4.50 GiB (GPU 0; 31.75 GiB total capacity; 23.94 GiB already allocated; 1.81 GiB free; 28.35 GiB reserved in total by PyTorch)

Jan 25 '22 18:01 lhatsk

Are you saying that it didn't OOM with the full 48 blocks?

Jan 26 '22 06:01 gahdritz

Yes

Jan 26 '22 08:01 lhatsk

Is this the 256 or 384 setting?

Jan 26 '22 18:01 gahdritz

384, but I kept "max_extra_msa": 1024 and "max_msa_clusters": 128 because everything else is too expensive. With 256 it works fine.

I guess reducing no_blocks will actually not have a big effect because we are checkpointing anyway, right? Still, memory increase is weird.

Jan 27 '22 11:01 lhatsk

Does torch allocate more total memory when you run it with 48 blocks? Sometimes it seems like torch opportunistically allocates less than it ultimately ends up needing. My hunch is that memory fragmentation, which seems to be happening here, is to blame for the underestimation.

Jan 27 '22 17:01 gahdritz

I will check. There might also be a regression between pytorch 1.9 and 1.10: https://github.com/pytorch/pytorch/issues/67680

Jan 27 '22 17:01 lhatsk

Yep, the 48 blocks network allocates roughly 1GB more. Downgrading to pytorch 1.9.1 didn't help.

Feb 02 '22 17:02 lhatsk

All I can say on this for now is that we're working on more memory-efficient attention. In principle, there's no reason why we shouldn't be able to get it as efficient as AlphaFold's.

Feb 03 '22 00:02 gahdritz

openfold openfold copied to clipboard

OOM after reducing number of evoformer blocks

openfold
openfold copied to clipboard