torchmd-net
torchmd-net copied to clipboard
Speed-up neighbors calculation
See #61
It is left as a draft PR, as I haven't had the chance to run the GPU code.
The performance results show that, the executions of both inference and training take less time. Most notably, from 90% of execution time is spent on the actual inference calculations of the model, up from only about 55% (70% for small molecules). This is more pronounced in big molecules/loads, where the percentage of time spent performing the actual calculations goes up to 98%.
This means execution time is now dedicated almost purely to model evaluation (which hasn't changed), rather than auxiliary computations.
It makes sense that the effect is less pronounced in small molecules (although still satisfactory), as the CPU implementation on those cases is fast enough, and the GPU implementation loses non-negligible time to communication. As mentioned, it is still faster.
This is as measured by profiling TorchMD_GN.forward_call
on metro16.
The results are equivalent, up to a tolerance (10e-5) in the distances, from the original implementation, which was the desired behaviour.
Could you post a table with the raw numbers of your benchmarks?
Sure, all of this is from metro16.
For the profiles, the time measurements are off due to the profiling itself, what matters are the percentages.
aten::linear
and aten::addmm
correspond to the model inference. Their time % after optimizing the neighbor search should be as high as possible.
CLN
Original
Name | Time % | Time total | # calls |
---|---|---|---|
main | 99.98 % | 345.437 ms | 1 |
aten::linear | 80.01 % | 276.414 ms | 34 |
aten::addmm | 79.69 % | 275.307 ms | 28 |
cudaFree | 78.60 % | 271.572 ms | 2 |
radius_graph | 1.23 % | 4.235 ms | 1 |
cudaLaunchKernel | 0.93 % | 3.210 ms | 286 |
torch_cluster::radius | 0.55 % | 1.888 ms | 1 |
aten::nonzero | 0.53 % | 1.826 ms | 6 |
aten::index_select | 0.49 % | 1.707 ms | 9 |
aten::mul | 0.47 % | 1.628 ms | 42 |
aten::index | 0.44 % | 1.507 ms | 6 |
aten::embedding | 0.36 % | 1.237 ms | 2 |
aten::masked_select | 0.30 % | 1.036 ms | 2 |
New
Name | Time % | Time total | of calls |
---|---|---|---|
main | 99.98 % | 289.450 ms | 1 |
aten::linear | 94.87 % | 274.645 ms | 34 |
aten::addmm | 94.46 % | 273.459 ms | 28 |
cudaFree | 93.16 % | 269.695 ms | 2 |
cudaLaunchKernel | 0.93 % | 2.699 ms | 231 |
neighbors::get_neighbor_list | 0.78 % | 2.263 ms | 1 |
aten::index_select | 0.72 % | 2.081 ms | 14 |
aten::mul | 0.56 % | 1.635 ms | 42 |
aten::embedding | 0.43 % | 1.251 ms | 2 |
CLN batch size 64
Original
Name | Time % | Total time | # of calls |
---|---|---|---|
main | 99.99 % | 518.933 ms | 1 |
aten::linear | 55.20 % | 286.492 ms | 34 |
aten::addmm | 54.99 % | 285.403 ms | 28 |
cudaFree | 54.28 % | 281.679 ms | 2 |
cudaMemcpyAsync | 30.14 % | 156.412 ms | 13 |
aten::item | 19.58 % | 101.631 ms | 6 |
aten::_local_scalar_dense | 19.58 % | 101.596 ms | 6 |
radius_graph | 12.85 % | 66.681 ms | 1 |
aten::nonzero | 10.91 % | 56.635 ms | 6 |
torch_cluster::radius | 10.73 % | 55.708 ms | 1 |
aten::masked_select | 9.07 % | 47.086 ms | 2 |
aten::index | 1.98 % | 10.257 ms | 6 |
cudaMalloc | 1.68 % | 8.713 ms | 7 |
aten::empty | 1.58 % | 8.200 ms | 52 |
aten::full | 1.50 % | 7.770 ms | 2 |
aten::is_nonzero | 0.97 % | 5.010 ms | 2 |
cudaLaunchKernel | 0.47 % | 2.424 ms | 293 |
aten::index_select | 0.32 % | 1.685 ms | 9 |
aten::mul | 0.29 % | 1.520 ms | 42 |
aten::embedding | 0.24 % | 1.240 ms | 2 |
New
Name | Time % | Time total | of calls |
---|---|---|---|
main | 99.99 % | 299.770 ms | 1 |
aten::linear | 91.54 % | 274.439 ms | 34 |
aten::addmm | 91.16 % | 273.321 ms | 28 |
cudaFree | 89.94 % | 269.639 ms | 2 |
aten::item | 3.35 % | 10.054 ms | 9 |
aten::_local_scalar_dense | 3.33 % | 9.998 ms | 9 |
cudaMemcpyAsync | 3.28 % | 9.820 ms | 9 |
neighbors::get_neighbor_list | 1.28 % | 3.837 ms | 1 |
FC9
Original
Name | Time % | Time total | of calls |
---|---|---|---|
main | 99.99 % | 472.592 ms | 1 |
aten::linear | 59.15 % | 279.578 ms | 34 |
aten::addmm | 58.92 % | 278.482 ms | 28 |
cudaFree | 58.10 % | 274.618 ms | 2 |
cudaMemcpyAsync | 24.85 % | 117.462 ms | 13 |
aten::item | 14.45 % | 68.286 ms | 6 |
aten::_local_scalar_dense | 14.44 % | 68.250 ms | 6 |
radius_graph | 13.02 % | 61.555 ms | 1 |
torch_cluster::radius | 11.15 % | 52.713 ms | 1 |
aten::nonzero | 10.82 % | 51.121 ms | 6 |
aten::masked_select | 9.64 % | 45.580 ms | 2 |
cudaMalloc | 1.60 % | 7.559 ms | 9 |
aten::empty | 1.45 % | 6.847 ms | 52 |
aten::full | 1.36 % | 6.447 ms | 2 |
aten::index | 1.33 % | 6.303 ms | 6 |
cudaLaunchKernel | 0.71 % | 3.347 ms | 293 |
aten::is_nonzero | 0.65 % | 3.063 ms | 2 |
aten::mul | 0.39 % | 1.865 ms | 42 |
aten::index_select | 0.37 % | 1.740 ms | 9 |
aten::embedding | 0.27 % | 1.262 ms | 2 |
New
Name | Time % | Time total | of calls |
---|---|---|---|
main | 99.98 % | 296.037 ms | 1 |
aten::linear | 92.72 % | 274.534 ms | 34 |
aten::addmm | 92.32 % | 273.345 ms | 28 |
cudaFree | 91.01 % | 269.472 ms | 2 |
aten::item | 2.29 % | 6.785 ms | 7 |
aten::_local_scalar_dense | 2.28 % | 6.738 ms | 7 |
cudaMemcpyAsync | 2.23 % | 6.594 ms | 7 |
cudaLaunchKernel | 0.91 % | 2.686 ms | 231 |
neighbors::get_neighbor_list | 0.82 % | 2.433 ms | 1 |
aten::index_select | 0.75 % | 2.210 ms | 14 |
aten::mul | 0.63 % | 1.868 ms | 42 |
aten::embedding | 0.42 % | 1.255 ms | 2 |
This selection of examples hopefully covers a wide enough range of scenarios. CLN being one of the smallest molecules, CLN batched 64 times is one of the biggest systems that could be evaluated on the hardware, and FC9 is the biggest molecule that can be executed.
Times (ms)
The following is elapsed time, which I calculated with the code from your benchmarks notebook. They also show how the new implementation is much more memory efficient, allowing us to run way bigger batch sizes.
Original
Batch size\Protein | ALA2 | CLN | DHFR | FC9 |
---|---|---|---|---|
1 | 5.59 | 6.52 | 50.94 | 124.32 |
2 | 5.70 | 7.47 | 95.92 | |
4 | 5.94 | 12.41 | 185.88 | |
8 | 6.45 | 22.58 | ||
16 | 7.38 | 43.11 | ||
32 | 9.33 | 84.62 | ||
64 | 16.32 | 167.19 | ||
128 | 30.56 | |||
256 | 59.32 | |||
512 | 117.01 |
New
Batch size\Protein | ALA2 | CLN | DHFR | FC9 |
---|---|---|---|---|
1 | 4.95 | 5.09 | 16.53 | 17.14 |
2 | 4.97 | 5.17 | 17.27 | 18.84 |
4 | 5.00 | 8.20 | 18.78 | 22.53 |
8 | 5.10 | 14.64 | 22.06 | 30.08 |
16 | 5.15 | 16.31 | 28.89 | 46.46 |
32 | 5.36 | 17.50 | 44.78 | 85.30 |
64 | 8.78 | 20.10 | 86.33 | 199.83 |
128 | 15.83 | 26.49 | 224.14 | 595.96 |
256 | 18.34 | 44.89 | 835.28 | 2234.87 |
512 | 21.69 | 102.03 | 3724.39 | 9306.15 |
1024 | 31.17 | 306.84 | 16505.30 |
@raimis have you had the chance to look at this?
Yes, the speed up of DHFR and FC9 looks very good.