torchmd-net icon indicating copy to clipboard operation
torchmd-net copied to clipboard

Ways to reduce memory use

Open peastman opened this issue 2 years ago • 15 comments

I'm trying to train equivariant transformer models on a GPU with 12 GB of memory. I can train small to medium sized models, but if I make it too large (for example, 6 layers with embedding dimension 96), CUDA runs out of device memory. Is there anything I can do to reduce the memory requirements? I already tried reducing the batch size but it didn't help.

peastman avatar May 31 '22 19:05 peastman

and of course batch size, what is your current batch size for training?

On Tue, May 31, 2022 at 9:44 PM Gianni De Fabritiis @.***> wrote:

Cutoff 5 and 5 interaction layers seem to be optimal, so see if that fits.

Maybe Raimondas optimizations could help?

On Tue, May 31, 2022 at 9:20 PM Peter Eastman @.***> wrote:

I'm trying to train equivariant transformer models on a GPU with 12 GB of memory. I can train small to medium sized models, but if I make it too large (for example, 6 layers with embedding dimension 96), CUDA runs out of device memory. Is there anything I can do to reduce the memory requirements? I already tried reducing the batch size but it didn't help.

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/96, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOU2FGJ3FP3X5I4RVR3VMZRADANCNFSM5XOSVTKA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

giadefa avatar May 31 '22 19:05 giadefa

16 bit floats?

PhilippThoelke avatar May 31 '22 19:05 PhilippThoelke

I'm using cutoff 10. I've found that 5 is far too short. It can't reproduce energies for molecules larger than about 40 atoms, and it has no chance at all on intermolecular interactions.

Batch size seems to have very little effect on memory use. With 5 layers I can use batch size 100. Add a sixth layer and it runs out of memory even if I reduce it to 1.

peastman avatar May 31 '22 20:05 peastman

with a cutoff of 10A what are you using for max_num_neighbors?

On Tue, May 31, 2022 at 10:04 PM Peter Eastman @.***> wrote:

I'm using cutoff 10. I've found that 5 is far too short. It can't reproduce energies for molecules larger than about 40 atoms, and it has no chance at all on intermolecular interactions.

Batch size seems to have very little effect on memory use. With 5 layers I can use batch size 100. Add a sixth layer and it runs out of memory even if I reduce it to 1.

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/96#issuecomment-1142590147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOWJXJWV4PIPWABM5GLVMZWERANCNFSM5XOSVTKA . You are receiving this because you commented.Message ID: @.***>

giadefa avatar May 31 '22 20:05 giadefa

Batch size seems to have very little effect on memory use. With 5 layers I can use batch size 100. Add a sixth layer and it runs out of memory even if I reduce it to 1.

That seems odd. Are you sure you are changing batch_size and inference_batch_size?

PhilippThoelke avatar May 31 '22 20:05 PhilippThoelke

with a cutoff of 10A what are you using for max_num_neighbors?

80

Are you sure you are changing batch_size and inference_batch_size?

I was changing only batch_size, not inference_batch_size. If I reduce both of them to 32 then I can get it to run. Thanks!

peastman avatar May 31 '22 20:05 peastman

Are you using the latest code that errors out if it gets above 80? With a 10A cutoff it seems possible.

On Tue, May 31, 2022 at 10:41 PM Peter Eastman @.***> wrote:

with a cutoff of 10A what are you using for max_num_neighbors?

80

Are you sure you are changing batch_size and inference_batch_size?

I was changing only batch_size, not inference_batch_size. If I reduce both of them to 32 then I can get it to run. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/96#issuecomment-1142621085, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOQSPXV4QWDG3T62QMDVMZ2PXANCNFSM5XOSVTKA . You are receiving this because you commented.Message ID: @.***>

giadefa avatar May 31 '22 20:05 giadefa

There's no problem with higher values than 80 (except of course running out of memory). 100 also works.

peastman avatar May 31 '22 20:05 peastman

With higher of course, but there can be more than 80 atoms in 10A.

On Tue, May 31, 2022 at 10:58 PM Peter Eastman @.***> wrote:

There's no problem with higher values than 80 (except of course running out of memory). 100 also works.

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/96#issuecomment-1142633931, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOURFPQ5FBDVCNGSFE3VMZ4QFANCNFSM5XOSVTKA . You are receiving this because you commented.Message ID: @.***>

giadefa avatar May 31 '22 21:05 giadefa

It depends on the particular samples. Does the value of max_num_neighbors apply only to training? Or does it set a limit on any molecule you can ever evaluate with the trained model?

peastman avatar May 31 '22 21:05 peastman

It applies always, not only during training. The argument determines the maximum number of neighbors that are collected in the neighbor list algorithm. You can overwrite it when you load a model checkpoint though, to set it to a higher number for inference for example.

PhilippThoelke avatar May 31 '22 21:05 PhilippThoelke

That's good to know. If I want to override it, would I just add the argument max_num_neighbors=100 in the call to load_from_checkpoint()?

peastman avatar May 31 '22 22:05 peastman

I think that should work but better make sure it actually overwrites it. I'd recommend using the torchmdnet.models.model.load_model function to load the model for inference, which strips away the pytorch lightning overhead. There you can just pass it as a keyword argument to overwrite it. You can also for example enable/disable force predictions at inference time using derivative=True/False.

PhilippThoelke avatar May 31 '22 23:05 PhilippThoelke

Cutoff 5 and 5 interaction layers seem to be optimal, so see if that fits.

Maybe Raimondas optimizations could help?

On Tue, May 31, 2022 at 9:20 PM Peter Eastman @.***> wrote:

I'm trying to train equivariant transformer models on a GPU with 12 GB of memory. I can train small to medium sized models, but if I make it too large (for example, 6 layers with embedding dimension 96), CUDA runs out of device memory. Is there anything I can do to reduce the memory requirements? I already tried reducing the batch size but it didn't help.

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/96, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOU2FGJ3FP3X5I4RVR3VMZRADANCNFSM5XOSVTKA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

giadefa avatar Oct 11 '22 09:10 giadefa

Cutoff 5 doesn't work for anything except very small molecules. Above about 40 atoms, it's essential to have a longer cutoff or you get very large errors. I'm hoping that once we add explicit terms for Coulomb and dispersion, that will allow using a shorter cutoff for the neural network.

peastman avatar Oct 11 '22 15:10 peastman