torchmd-net Optimization of the graph network

Optimization of the graph network

Open raimis opened this issue 2 years ago • 14 comments

Optimization of the graph network (TorchMD_GN) with NNPOps (https://github.com/openmm/NNPOps).

In a special case, TorchMD_GN is equivalent to SchNet (https://github.com/compsciencelab/torchmd-net/issues/45#issuecomment-968962455), which is already supported by NNPOps:

TorchMD_GN(rbf_type="gauss", trainable_rbf=False, activation="ssp", neighbor_embedding=False)

[x] Implement PyTorch wrapper for CFConvNeighbors and CFConv -- https://github.com/openmm/NNPOps/pull/40
[x] Accelerate the limited TorchMD_GN with NNPOps -- https://github.com/torchmd/torchmd-net/pull/50
[x] Update the installation instructions -- https://github.com/torchmd/torchmd-net/pull/55
- [x] NNPOps package -- https://github.com/openmm/NNPOps/issues/26
- [x] PyTorch Geometric package -- https://github.com/torchmd/torchmd-net/pull/53

In general, TorchMD_GN needs these:

TorchMD_GN(rbf_type="expnorm", trainable_rbf=True, activation="silu", neighbor_embedding=True)

[ ] Implement the exponentially-modified Gaussian in CFConv (rbf_type="expnorm")
[ ] Allow to pass arbitrary RBF positions to CFConv (trainable_rbf=True)
[ ] Implement the SILU activation in CFConv (activation="silu")
[ ] Reuse CFConv to accelerate the neighbor embedding (neighbor_embedding=True)

Nov 23 '21 16:11 raimis

Regarding the interface, it should look and work like this:

 # Create or load a model in any way
model = TorchMD_GN()

# Optional: train or do what ever you want with the model

# Optimize the model
from torchmdnet.optimize import optimize
optimized_model = optimize(model, some_optimization_options)

# Do the inference with the model
results = optimized_model.forward(z, pos, batch)

# Optional: convert the model into TorchScript and save for external use (e.g. OpenMM-Torch)
torch.jit.script(optimized_model).save('model.pt')

It is similar, what is being implemented for the TorchANI optimization (https://github.com/raimis/NNPOps/blob/opt_ani/README.md#example).

@PhilippThoelke @stefdoerr @giadefa any comments?

Nov 24 '21 13:11 raimis

For a moment, it seems all the PyTorch-Geometric packages are broken (https://github.com/pyg-team/pytorch_geometric/issues/3660).

Dec 09 '21 12:12 raimis

@peastman I have just finished integrating NNPOps (https://github.com/torchmd/torchmd-net/pull/50). The performance (https://github.com/torchmd/torchmd-net/blob/main/benchmarks/graph_network.ipynb) is just 2-3 time better for the small molecules (10-100 atoms) and no significant improvement for the larger ones.

I'll try to profile to get a better insight. At some, we should discuss, if we can make any further improvements.

cc: @giadefa

Feb 22 '22 12:02 raimis

It would be useful to separate out all the different optimizations in NNPOps. Can you identify the effect of each one separately?

Back when we first started designing it, we discussed requirements and decided it would be optimized for molecules of about 100 atoms. The code is all designed around that assumption. If we want good performance on much larger molecules, it would need to be written differently. For example, it uses a O(n^2) algorithm to build the neighbor list, which is very fast for small molecules and very slow for large ones.

Feb 22 '22 16:02 peastman

100 particles is a good case. However, we have two ways of running multiple simulations. In one we batch them where the same molecule is run in a single NN batch for forces and your kernel does not batch. In another way, we simply make multiple copies, e.g. 64 far enough and run it as a single system but in this case the system size is more like 6400 particles.

I think that we need batching in the CUDA kernel for identical molecules and cell lists which are very fast.

On Tue, Feb 22, 2022 at 5:46 PM Peter Eastman @.***> wrote:

It would be useful to separate out all the different optimizations in NNPOps. Can you identify the effect of each one separately?

Back when we first started designing it, we discussed requirements and decided it would be optimized for molecules of about 100 atoms. The code is all designed around that assumption. If we want good performance on much larger molecules, it would need to be written differently. For example, it uses a O(n^2) algorithm to build the neighbor list, which is very fast for small molecules and very slow for large ones.

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/48#issuecomment-1047996832, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOQDRFKIOXWBGYPNER3U4O4UXANCNFSM5IT6CSDQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

Feb 22 '22 16:02 giadefa

That would definitely need code changes to be efficient. You want it to know it only needs to check each atom against the other atoms in its own copy, not all the other copies. Spreading the copies out through space is also inaccurate. The further an atom is from the origin, the less precisely its position can be specified.

Feb 22 '22 17:02 peastman

On Tue, Feb 22, 2022 at 6:15 PM Peter Eastman @.***> wrote:

That would definitely need code changes to be efficient. You want it to know it only needs to check each atom against the other atoms in its own copy, not all the other copies. Spreading the copies out through space is also inaccurate. The further an atom is from the origin, the less precisely its position can be specified.

Yes, the accuracy is a problem but it is quite efficient done this way. Some tests on forces showed reasonable results though but I agree that it is a problem. The other way is batching, can you add batching of multiple copies of the same molecules in your CUDA kernel?

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/48#issuecomment-1048025066, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOQ5S5EDVZCAV6YRUILU4PABXANCNFSM5IT6CSDQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

Feb 22 '22 17:02 giadefa

It's possible. Can you open an issue on the NNPOps repository describing exactly how you would want it to work?

Feb 22 '22 17:02 peastman

@raimis can yuo make an issue there as you probbably know the details of what you need in NNPOps.

Feb 23 '22 08:02 giadefa

Just before going into NNPOps, I checked how much CUDA Graphs (https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) can help.

CUDA Graphs don't work with TorchMD_GN due to https://github.com/rusty1s/pytorch_cluster/issues/123. To circumvent this, I have implemented a fake neighbor search (#60), which assumes that all the atoms are neighbors, a.k.a. brute force.

The results (https://github.com/raimis/torchmd-net/blob/poc_cuda_graph/benchmarks/graph_network.ipynb) are promising:

For alanine dipeptide (ALA2, 22 atoms) and testosterone (TST, 49 atoms), the brute force approach with CUDA Graphs beat everything else.
For chignolin (CLN, 166 atoms), the brute force is not longer the best and, for larger systems, it runs out of memory.

Ping: @giadefa @peastman @claudi

Feb 23 '22 15:02 raimis

nice and could we batch that?

On Wed, Feb 23, 2022 at 4:24 PM Raimondas Galvelis @.***> wrote:

Just before going into NNPOps, I checked how much CUDA Graphs ( https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/) can help.

CUDA Graphs don't work with TorchMD_GN due to rusty1s/pytorch_cluster#123 https://github.com/rusty1s/pytorch_cluster/issues/123. To circumvent this, I have implemented a fake neighbor search, which assume that all the atoms are neighbors, a.k.a. brute force.

The results ( https://github.com/raimis/torchmd-net/blob/poc_cuda_graph/benchmarks/graph_network.ipynb) are promising: [image: image] https://user-images.githubusercontent.com/2469715/155347806-22cc0fb6-29eb-4cea-b504-b3c75e9f91ef.png

For alanine dipeptide (ALA2, 22 atoms) and testosterone (TST, 49 atoms), the brute force approach with CUDA Graphs beat everything else.

For chignolin (CLN, 166 atoms), the brute force is not longer the best and for larger systems it runs out of memory.

Ping: @giadefa https://github.com/giadefa @peastman https://github.com/peastman @claudi https://github.com/claudi

— Reply to this email directly, view it on GitHub https://github.com/torchmd/torchmd-net/issues/48#issuecomment-1048899019, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOXNXMXRGQU6AYQQ43LU4T34FANCNFSM5IT6CSDQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

Feb 23 '22 15:02 giadefa

The current implementation doesn't support batching, but it could be implemented.

Feb 23 '22 15:02 raimis

That's interesting. It tells us that for the smaller molecules, the computation time is just dominated by kernel launch overhead.

Feb 23 '22 17:02 peastman

Optimization: round 2

I have wrote optimized kernels for the neighbor search (#61) and message passing (#69). The kernels are drop-in replacement for the generic kernels from PyTorch Geometric and have such optimizations:

Take into account the symmetry (i.e. if A is bonded to B, then B is bonded to A), assume 3D space, etc. (borrowed ideas from @peastman code https://github.com/openmm/NNPOps/tree/master/src/schnet)
Compatible with CUDA Graphs

Speed:

kernels use just the new kernels
kernels+graphs use the new kernels and CUDA Graphs
Other benchmarks as in the previous plot (https://github.com/torchmd/torchmd-net/issues/48#issuecomment-1048899019)
There is a significant speed up for the small molecules, as it even more removes overhead.
For large molecule, the speed is comparable to the @peastman kernels, as time is dominated by computation by itself.
The new kernels fail with STMV, but not due to the lack of memory. Still I need to debug the cause.

Full details in the notebook: https://github.com/raimis/torchmd-net/blob/poc_cuda_graph_2/benchmarks/graph_network.ipynb

Apr 25 '22 14:04 raimis

torchmd-net torchmd-net copied to clipboard

Optimization of the graph network

torchmd-net
torchmd-net copied to clipboard