See #61

Mar 29 '22 14:03 claudi

It is left as a draft PR, as I haven't had the chance to run the GPU code.

Mar 29 '22 14:03 claudi

The performance results show that, the executions of both inference and training take less time. Most notably, from 90% of execution time is spent on the actual inference calculations of the model, up from only about 55% (70% for small molecules). This is more pronounced in big molecules/loads, where the percentage of time spent performing the actual calculations goes up to 98%.

This means execution time is now dedicated almost purely to model evaluation (which hasn't changed), rather than auxiliary computations.

It makes sense that the effect is less pronounced in small molecules (although still satisfactory), as the CPU implementation on those cases is fast enough, and the GPU implementation loses non-negligible time to communication. As mentioned, it is still faster.

This is as measured by profiling TorchMD_GN.forward_call on metro16.

The results are equivalent, up to a tolerance (10e-5) in the distances, from the original implementation, which was the desired behaviour.

Jun 24 '22 10:06 claudi

Could you post a table with the raw numbers of your benchmarks?

Jun 27 '22 10:06 raimis

Sure, all of this is from metro16.

For the profiles, the time measurements are off due to the profiling itself, what matters are the percentages.

aten::linear and aten::addmm correspond to the model inference. Their time % after optimizing the neighbor search should be as high as possible.

CLN

Original

Name	Time %	Time total	# calls
main	99.98 %	345.437 ms	1
aten::linear	80.01 %	276.414 ms	34
aten::addmm	79.69 %	275.307 ms	28
cudaFree	78.60 %	271.572 ms	2
radius_graph	1.23 %	4.235 ms	1
cudaLaunchKernel	0.93 %	3.210 ms	286
torch_cluster::radius	0.55 %	1.888 ms	1
aten::nonzero	0.53 %	1.826 ms	6
aten::index_select	0.49 %	1.707 ms	9
aten::mul	0.47 %	1.628 ms	42
aten::index	0.44 %	1.507 ms	6
aten::embedding	0.36 %	1.237 ms	2
aten::masked_select	0.30 %	1.036 ms	2

New

Name	Time %	Time total	of calls
main	99.98 %	289.450 ms	1
aten::linear	94.87 %	274.645 ms	34
aten::addmm	94.46 %	273.459 ms	28
cudaFree	93.16 %	269.695 ms	2
cudaLaunchKernel	0.93 %	2.699 ms	231
neighbors::get_neighbor_list	0.78 %	2.263 ms	1
aten::index_select	0.72 %	2.081 ms	14
aten::mul	0.56 %	1.635 ms	42
aten::embedding	0.43 %	1.251 ms	2

CLN batch size 64

Original

Name	Time %	Total time	# of calls
main	99.99 %	518.933 ms	1
aten::linear	55.20 %	286.492 ms	34
aten::addmm	54.99 %	285.403 ms	28
cudaFree	54.28 %	281.679 ms	2
cudaMemcpyAsync	30.14 %	156.412 ms	13
aten::item	19.58 %	101.631 ms	6
aten::_local_scalar_dense	19.58 %	101.596 ms	6
radius_graph	12.85 %	66.681 ms	1
aten::nonzero	10.91 %	56.635 ms	6
torch_cluster::radius	10.73 %	55.708 ms	1
aten::masked_select	9.07 %	47.086 ms	2
aten::index	1.98 %	10.257 ms	6
cudaMalloc	1.68 %	8.713 ms	7
aten::empty	1.58 %	8.200 ms	52
aten::full	1.50 %	7.770 ms	2
aten::is_nonzero	0.97 %	5.010 ms	2
cudaLaunchKernel	0.47 %	2.424 ms	293
aten::index_select	0.32 %	1.685 ms	9
aten::mul	0.29 %	1.520 ms	42
aten::embedding	0.24 %	1.240 ms	2

New

Name	Time %	Time total	of calls
main	99.99 %	299.770 ms	1
aten::linear	91.54 %	274.439 ms	34
aten::addmm	91.16 %	273.321 ms	28
cudaFree	89.94 %	269.639 ms	2
aten::item	3.35 %	10.054 ms	9
aten::_local_scalar_dense	3.33 %	9.998 ms	9
cudaMemcpyAsync	3.28 %	9.820 ms	9
neighbors::get_neighbor_list	1.28 %	3.837 ms	1

FC9

Original

Name	Time %	Time total	of calls
main	99.99 %	472.592 ms	1
aten::linear	59.15 %	279.578 ms	34
aten::addmm	58.92 %	278.482 ms	28
cudaFree	58.10 %	274.618 ms	2
cudaMemcpyAsync	24.85 %	117.462 ms	13
aten::item	14.45 %	68.286 ms	6
aten::_local_scalar_dense	14.44 %	68.250 ms	6
radius_graph	13.02 %	61.555 ms	1
torch_cluster::radius	11.15 %	52.713 ms	1
aten::nonzero	10.82 %	51.121 ms	6
aten::masked_select	9.64 %	45.580 ms	2
cudaMalloc	1.60 %	7.559 ms	9
aten::empty	1.45 %	6.847 ms	52
aten::full	1.36 %	6.447 ms	2
aten::index	1.33 %	6.303 ms	6
cudaLaunchKernel	0.71 %	3.347 ms	293
aten::is_nonzero	0.65 %	3.063 ms	2
aten::mul	0.39 %	1.865 ms	42
aten::index_select	0.37 %	1.740 ms	9
aten::embedding	0.27 %	1.262 ms	2

New

Name	Time %	Time total	of calls
main	99.98 %	296.037 ms	1
aten::linear	92.72 %	274.534 ms	34
aten::addmm	92.32 %	273.345 ms	28
cudaFree	91.01 %	269.472 ms	2
aten::item	2.29 %	6.785 ms	7
aten::_local_scalar_dense	2.28 %	6.738 ms	7
cudaMemcpyAsync	2.23 %	6.594 ms	7
cudaLaunchKernel	0.91 %	2.686 ms	231
neighbors::get_neighbor_list	0.82 %	2.433 ms	1
aten::index_select	0.75 %	2.210 ms	14
aten::mul	0.63 %	1.868 ms	42
aten::embedding	0.42 %	1.255 ms	2

This selection of examples hopefully covers a wide enough range of scenarios. CLN being one of the smallest molecules, CLN batched 64 times is one of the biggest systems that could be evaluated on the hardware, and FC9 is the biggest molecule that can be executed.

Times (ms)

The following is elapsed time, which I calculated with the code from your benchmarks notebook. They also show how the new implementation is much more memory efficient, allowing us to run way bigger batch sizes.

Original

Batch size\Protein	ALA2	CLN	DHFR	FC9
1	5.59	6.52	50.94	124.32
2	5.70	7.47	95.92
4	5.94	12.41	185.88
8	6.45	22.58
16	7.38	43.11
32	9.33	84.62
64	16.32	167.19
128	30.56
256	59.32
512	117.01

New

Batch size\Protein	ALA2	CLN	DHFR	FC9
1	4.95	5.09	16.53	17.14
2	4.97	5.17	17.27	18.84
4	5.00	8.20	18.78	22.53
8	5.10	14.64	22.06	30.08
16	5.15	16.31	28.89	46.46
32	5.36	17.50	44.78	85.30
64	8.78	20.10	86.33	199.83
128	15.83	26.49	224.14	595.96
256	18.34	44.89	835.28	2234.87
512	21.69	102.03	3724.39	9306.15
1024	31.17	306.84	16505.30

Jun 28 '22 16:06 claudi

@raimis have you had the chance to look at this?

Jul 05 '22 17:07 claudi

Yes, the speed up of DHFR and FC9 looks very good.

Jul 06 '22 14:07 raimis

torchmd-net
torchmd-net copied to clipboard

Speed-up neighbors calculation

CLN

Original

New

CLN batch size 64

Original

New

FC9

Original

New

Times (ms)

Original

New

torchmd-net torchmd-net copied to clipboard

Speed-up neighbors calculation

CLN

Original

New

CLN batch size 64

Original

New

FC9

Original

New

Times (ms)

Original

New

torchmd-net
torchmd-net copied to clipboard