torchmd-net
torchmd-net copied to clipboard
Ways to reduce memory use
I'm trying to train equivariant transformer models on a GPU with 12 GB of memory. I can train small to medium sized models, but if I make it too large (for example, 6 layers with embedding dimension 96), CUDA runs out of device memory. Is there anything I can do to reduce the memory requirements? I already tried reducing the batch size but it didn't help.
and of course batch size, what is your current batch size for training?
On Tue, May 31, 2022 at 9:44 PM Gianni De Fabritiis @.***> wrote:
Cutoff 5 and 5 interaction layers seem to be optimal, so see if that fits.
Maybe Raimondas optimizations could help?
On Tue, May 31, 2022 at 9:20 PM Peter Eastman @.***> wrote:
I'm trying to train equivariant transformer models on a GPU with 12 GB of memory. I can train small to medium sized models, but if I make it too large (for example, 6 layers with embedding dimension 96), CUDA runs out of device memory. Is there anything I can do to reduce the memory requirements? I already tried reducing the batch size but it didn't help.
— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/96, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOU2FGJ3FP3X5I4RVR3VMZRADANCNFSM5XOSVTKA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
16 bit floats?
I'm using cutoff 10. I've found that 5 is far too short. It can't reproduce energies for molecules larger than about 40 atoms, and it has no chance at all on intermolecular interactions.
Batch size seems to have very little effect on memory use. With 5 layers I can use batch size 100. Add a sixth layer and it runs out of memory even if I reduce it to 1.
with a cutoff of 10A what are you using for max_num_neighbors?
On Tue, May 31, 2022 at 10:04 PM Peter Eastman @.***> wrote:
I'm using cutoff 10. I've found that 5 is far too short. It can't reproduce energies for molecules larger than about 40 atoms, and it has no chance at all on intermolecular interactions.
Batch size seems to have very little effect on memory use. With 5 layers I can use batch size 100. Add a sixth layer and it runs out of memory even if I reduce it to 1.
— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/96#issuecomment-1142590147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOWJXJWV4PIPWABM5GLVMZWERANCNFSM5XOSVTKA . You are receiving this because you commented.Message ID: @.***>
Batch size seems to have very little effect on memory use. With 5 layers I can use batch size 100. Add a sixth layer and it runs out of memory even if I reduce it to 1.
That seems odd. Are you sure you are changing batch_size
and inference_batch_size
?
with a cutoff of 10A what are you using for max_num_neighbors?
80
Are you sure you are changing batch_size and inference_batch_size?
I was changing only batch_size
, not inference_batch_size
. If I reduce both of them to 32 then I can get it to run. Thanks!
Are you using the latest code that errors out if it gets above 80? With a 10A cutoff it seems possible.
On Tue, May 31, 2022 at 10:41 PM Peter Eastman @.***> wrote:
with a cutoff of 10A what are you using for max_num_neighbors?
80
Are you sure you are changing batch_size and inference_batch_size?
I was changing only batch_size, not inference_batch_size. If I reduce both of them to 32 then I can get it to run. Thanks!
— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/96#issuecomment-1142621085, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOQSPXV4QWDG3T62QMDVMZ2PXANCNFSM5XOSVTKA . You are receiving this because you commented.Message ID: @.***>
There's no problem with higher values than 80 (except of course running out of memory). 100 also works.
With higher of course, but there can be more than 80 atoms in 10A.
On Tue, May 31, 2022 at 10:58 PM Peter Eastman @.***> wrote:
There's no problem with higher values than 80 (except of course running out of memory). 100 also works.
— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/96#issuecomment-1142633931, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOURFPQ5FBDVCNGSFE3VMZ4QFANCNFSM5XOSVTKA . You are receiving this because you commented.Message ID: @.***>
It depends on the particular samples. Does the value of max_num_neighbors
apply only to training? Or does it set a limit on any molecule you can ever evaluate with the trained model?
It applies always, not only during training. The argument determines the maximum number of neighbors that are collected in the neighbor list algorithm. You can overwrite it when you load a model checkpoint though, to set it to a higher number for inference for example.
That's good to know. If I want to override it, would I just add the argument max_num_neighbors=100
in the call to load_from_checkpoint()
?
I think that should work but better make sure it actually overwrites it. I'd recommend using the torchmdnet.models.model.load_model
function to load the model for inference, which strips away the pytorch lightning overhead. There you can just pass it as a keyword argument to overwrite it. You can also for example enable/disable force predictions at inference time using derivative=True/False
.
Cutoff 5 and 5 interaction layers seem to be optimal, so see if that fits.
Maybe Raimondas optimizations could help?
On Tue, May 31, 2022 at 9:20 PM Peter Eastman @.***> wrote:
I'm trying to train equivariant transformer models on a GPU with 12 GB of memory. I can train small to medium sized models, but if I make it too large (for example, 6 layers with embedding dimension 96), CUDA runs out of device memory. Is there anything I can do to reduce the memory requirements? I already tried reducing the batch size but it didn't help.
— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/96, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOU2FGJ3FP3X5I4RVR3VMZRADANCNFSM5XOSVTKA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Cutoff 5 doesn't work for anything except very small molecules. Above about 40 atoms, it's essential to have a longer cutoff or you get very large errors. I'm hoping that once we add explicit terms for Coulomb and dispersion, that will allow using a shorter cutoff for the neural network.