torchmd-net icon indicating copy to clipboard operation
torchmd-net copied to clipboard

Speed-up neighbors calculation

Open claudi opened this issue 2 years ago • 6 comments

See #61

claudi avatar Mar 29 '22 14:03 claudi

It is left as a draft PR, as I haven't had the chance to run the GPU code.

claudi avatar Mar 29 '22 14:03 claudi

The performance results show that, the executions of both inference and training take less time. Most notably, from 90% of execution time is spent on the actual inference calculations of the model, up from only about 55% (70% for small molecules). This is more pronounced in big molecules/loads, where the percentage of time spent performing the actual calculations goes up to 98%.

This means execution time is now dedicated almost purely to model evaluation (which hasn't changed), rather than auxiliary computations.

It makes sense that the effect is less pronounced in small molecules (although still satisfactory), as the CPU implementation on those cases is fast enough, and the GPU implementation loses non-negligible time to communication. As mentioned, it is still faster.

This is as measured by profiling TorchMD_GN.forward_call on metro16.

The results are equivalent, up to a tolerance (10e-5) in the distances, from the original implementation, which was the desired behaviour.

claudi avatar Jun 24 '22 10:06 claudi

Could you post a table with the raw numbers of your benchmarks?

raimis avatar Jun 27 '22 10:06 raimis

Sure, all of this is from metro16.

For the profiles, the time measurements are off due to the profiling itself, what matters are the percentages.

aten::linear and aten::addmm correspond to the model inference. Their time % after optimizing the neighbor search should be as high as possible.

CLN

Original

Name Time % Time total # calls
main 99.98 % 345.437 ms 1
aten::linear 80.01 % 276.414 ms 34
aten::addmm 79.69 % 275.307 ms 28
cudaFree 78.60 % 271.572 ms 2
radius_graph 1.23 % 4.235 ms 1
cudaLaunchKernel 0.93 % 3.210 ms 286
torch_cluster::radius 0.55 % 1.888 ms 1
aten::nonzero 0.53 % 1.826 ms 6
aten::index_select 0.49 % 1.707 ms 9
aten::mul 0.47 % 1.628 ms 42
aten::index 0.44 % 1.507 ms 6
aten::embedding 0.36 % 1.237 ms 2
aten::masked_select 0.30 % 1.036 ms 2

New

Name Time % Time total of calls
main 99.98 % 289.450 ms 1
aten::linear 94.87 % 274.645 ms 34
aten::addmm 94.46 % 273.459 ms 28
cudaFree 93.16 % 269.695 ms 2
cudaLaunchKernel 0.93 % 2.699 ms 231
neighbors::get_neighbor_list 0.78 % 2.263 ms 1
aten::index_select 0.72 % 2.081 ms 14
aten::mul 0.56 % 1.635 ms 42
aten::embedding 0.43 % 1.251 ms 2

CLN batch size 64

Original

Name Time % Total time # of calls
main 99.99 % 518.933 ms 1
aten::linear 55.20 % 286.492 ms 34
aten::addmm 54.99 % 285.403 ms 28
cudaFree 54.28 % 281.679 ms 2
cudaMemcpyAsync 30.14 % 156.412 ms 13
aten::item 19.58 % 101.631 ms 6
aten::_local_scalar_dense 19.58 % 101.596 ms 6
radius_graph 12.85 % 66.681 ms 1
aten::nonzero 10.91 % 56.635 ms 6
torch_cluster::radius 10.73 % 55.708 ms 1
aten::masked_select 9.07 % 47.086 ms 2
aten::index 1.98 % 10.257 ms 6
cudaMalloc 1.68 % 8.713 ms 7
aten::empty 1.58 % 8.200 ms 52
aten::full 1.50 % 7.770 ms 2
aten::is_nonzero 0.97 % 5.010 ms 2
cudaLaunchKernel 0.47 % 2.424 ms 293
aten::index_select 0.32 % 1.685 ms 9
aten::mul 0.29 % 1.520 ms 42
aten::embedding 0.24 % 1.240 ms 2

New

Name Time % Time total of calls
main 99.99 % 299.770 ms 1
aten::linear 91.54 % 274.439 ms 34
aten::addmm 91.16 % 273.321 ms 28
cudaFree 89.94 % 269.639 ms 2
aten::item 3.35 % 10.054 ms 9
aten::_local_scalar_dense 3.33 % 9.998 ms 9
cudaMemcpyAsync 3.28 % 9.820 ms 9
neighbors::get_neighbor_list 1.28 % 3.837 ms 1

FC9

Original

Name Time % Time total of calls
main 99.99 % 472.592 ms 1
aten::linear 59.15 % 279.578 ms 34
aten::addmm 58.92 % 278.482 ms 28
cudaFree 58.10 % 274.618 ms 2
cudaMemcpyAsync 24.85 % 117.462 ms 13
aten::item 14.45 % 68.286 ms 6
aten::_local_scalar_dense 14.44 % 68.250 ms 6
radius_graph 13.02 % 61.555 ms 1
torch_cluster::radius 11.15 % 52.713 ms 1
aten::nonzero 10.82 % 51.121 ms 6
aten::masked_select 9.64 % 45.580 ms 2
cudaMalloc 1.60 % 7.559 ms 9
aten::empty 1.45 % 6.847 ms 52
aten::full 1.36 % 6.447 ms 2
aten::index 1.33 % 6.303 ms 6
cudaLaunchKernel 0.71 % 3.347 ms 293
aten::is_nonzero 0.65 % 3.063 ms 2
aten::mul 0.39 % 1.865 ms 42
aten::index_select 0.37 % 1.740 ms 9
aten::embedding 0.27 % 1.262 ms 2

New

Name Time % Time total of calls
main 99.98 % 296.037 ms 1
aten::linear 92.72 % 274.534 ms 34
aten::addmm 92.32 % 273.345 ms 28
cudaFree 91.01 % 269.472 ms 2
aten::item 2.29 % 6.785 ms 7
aten::_local_scalar_dense 2.28 % 6.738 ms 7
cudaMemcpyAsync 2.23 % 6.594 ms 7
cudaLaunchKernel 0.91 % 2.686 ms 231
neighbors::get_neighbor_list 0.82 % 2.433 ms 1
aten::index_select 0.75 % 2.210 ms 14
aten::mul 0.63 % 1.868 ms 42
aten::embedding 0.42 % 1.255 ms 2

This selection of examples hopefully covers a wide enough range of scenarios. CLN being one of the smallest molecules, CLN batched 64 times is one of the biggest systems that could be evaluated on the hardware, and FC9 is the biggest molecule that can be executed.

Times (ms)

The following is elapsed time, which I calculated with the code from your benchmarks notebook. They also show how the new implementation is much more memory efficient, allowing us to run way bigger batch sizes.

Original

Batch size\Protein ALA2 CLN DHFR FC9
1 5.59 6.52 50.94 124.32
2 5.70 7.47 95.92
4 5.94 12.41 185.88
8 6.45 22.58
16 7.38 43.11
32 9.33 84.62
64 16.32 167.19
128 30.56
256 59.32
512 117.01

New

Batch size\Protein ALA2 CLN DHFR FC9
1 4.95 5.09 16.53 17.14
2 4.97 5.17 17.27 18.84
4 5.00 8.20 18.78 22.53
8 5.10 14.64 22.06 30.08
16 5.15 16.31 28.89 46.46
32 5.36 17.50 44.78 85.30
64 8.78 20.10 86.33 199.83
128 15.83 26.49 224.14 595.96
256 18.34 44.89 835.28 2234.87
512 21.69 102.03 3724.39 9306.15
1024 31.17 306.84 16505.30

claudi avatar Jun 28 '22 16:06 claudi

@raimis have you had the chance to look at this?

claudi avatar Jul 05 '22 17:07 claudi

Yes, the speed up of DHFR and FC9 looks very good.

raimis avatar Jul 06 '22 14:07 raimis