AutoDock-GPU Speeding up sum reductions in ADADELTA by using Tensor Cores

Hi,

This PR aims to increase the performance of the CUDA version by leveraging the Tensor Cores Units (TCU) present in recent NVIDIA GPUs.

The idea is to re-implement the sum reductions as matrix operations (i.e., by using NVIDIA Warp Matrix Functions), which can be offloaded to TCUs.

Experiments on A100 GPU (make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 test):

Docking time	Original	Tensor
In seconds	0.8	0.6

Experiments on RTX3050Ti GPU (make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test):

Docking time	Original	Tensor
In seconds	2.4	1.7

The baseline implementation for this PR has been taken from this paper: Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores. The contribution of both authors, Gabin Schieffer (@gabin-s) and Ivy Peng (@bopkth), is acknowledged in this PR as well:

Schieffer, Gabin, and Peng, Ivy. "Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores."
In European Conference on Parallel Processing, pp. 608-622. Cham: Springer Nature Switzerland, 2023.

Jan 11 '24 14:01 L30nardoSV

@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run.

Jan 11 '24 16:01 atillack

@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run.

OK, let me know if the TENSOR directive in commit 10b07fa6a suffices

Jan 11 '24 18:01 L30nardoSV

@L30nardoSV I tested on one of our Nvidia Quadro RTX A5000 cards and I do see a nice speedup for the 3ce3 example input:

Docking time of the PR with make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test is 0.70 seconds vs 0.90 seconds (this does use the heuristics and autostop by default).

To evaluate a bit further I used Diogo's test set of 42 ligands, here are the results:

Reference:

Path	NUMWI	AutoStop & Heuristics	overall evals	energy	rmsd	docking	idle
OpenCL	128	no, 2.5M evals	105382018	36 / 42 good	36 / 42 good	91.69 s	0.17 s
Cuda	128	no, 2.5M evals	105331182	36 / 42 good	36 / 42 good	90.75 s	0.32 s
Cuda	64	no, 2.5M evals	105404961	36 / 42 good	35 / 42 good	106.24 s	7.93 s
OpenCL	128	yes	84026192	37 / 42 good	37 / 42 good	187.82 s	0.21 s
Cuda	128	yes	80037847	38 / 42 good	38 / 42 good	184.85 s	0.20 s
Cuda	64	yes	84684628	36 / 42 good	38 / 42 good	233.39 s	8.13 s

This PR:

Path	NUMWI	AutoStop & Heuristics	overall evals	energy	rmsd	docking	idle
OpenCL	128	no, 2.5M evals	105362595	38 / 42 good	36 / 42 good	92.33 s	0.21 s
Cuda	128	no, 2.5M evals	105177642	35 / 42 good	36 / 42 good	100.71 s	0.20 s
Cuda	64	no, 2.5M evals	105197433	35 / 42 good	38 / 42 good	112.48 s	0.19 s
OpenCL	128	yes	86495325	37 / 42 good	38 / 42 good	192.71 s	0.21 s
Cuda	128	yes	71419809	33 / 42 good	37 / 42 good	182.30 s	0.21 s
Cuda	64	yes	65754981	34 / 42 good	37 / 42 good	214.60 s	0.22 s

For multiple differently sized ligands with the typical settings It turns out for larger systems the speedup can turn into a slowdown.

It looks like the average number of evals w/ AutoStop changed in the PR which could potentially point to a minute difference in calculation (i did test multiple times to make sure this wasn't just an unlucky run).

@diogomart Please run your E50 tests for the Cuda version.

Jan 11 '24 19:01 atillack

@L30nardoSV Thank you for the encapsulation :-)

Jan 11 '24 19:01 atillack

Unfortunately, algorithmic performance is worse.

79f13c7-ocl-128wi_vs_PR252-10b07fa-cuda-tensor-128wi-overlap

Jan 12 '24 17:01 diogomart

@atillack

Can you please check commit b2ab3fe that incorporates an WMMA Extension for single precision matmul on Tensor Cores + error correction (TCEC)?

make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 TENSOR=ON TCEC=ON test

Ref: https://github.com/wmmae/wmma_extension/blob/main/docs/mma_f32.md

Feb 16 '24 19:02 L30nardoSV

@L30nardoSV I ran the newest version and here are the results (with OpenCL from before as comparison, note: i compiled w/o OVERLAP so the last column takes a bit longer, but compute times are unaffected):

Path	NUMWI	AutoStop & Heuristics	overall evals	energy	rmsd	docking	idle
OpenCL	128	yes	86495325	37 / 42 good	38 / 42 good	192.71 s	0.21 s
Cuda	128	yes	88164353	36 / 42 good	38 / 42 good	194.13 s	7.74 s
Cuda	64	yes	77884078	37 / 42 good	37 / 42 good	214.27 s	21.54 s

Feb 20 '24 20:02 atillack

While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total).

Feb 20 '24 20:02 atillack

While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total).

Thanks, I look forward to seeing whether the search efficiency is fine at least

Mar 04 '24 13:03 L30nardoSV

I'll get to this soon

Mar 04 '24 19:03 diogomart

AutoDock-GPU AutoDock-GPU copied to clipboard

Speeding up sum reductions in ADADELTA by using Tensor Cores

AutoDock-GPU
AutoDock-GPU copied to clipboard