AutoDock-GPU icon indicating copy to clipboard operation
AutoDock-GPU copied to clipboard

Speeding up sum reductions in ADADELTA by using Tensor Cores

Open L30nardoSV opened this issue 1 year ago • 33 comments

Hi,

This PR aims to increase the performance of the CUDA version by leveraging the Tensor Cores Units (TCU) present in recent NVIDIA GPUs.

The idea is to re-implement the sum reductions as matrix operations (i.e., by using NVIDIA Warp Matrix Functions), which can be offloaded to TCUs.

Experiments on A100 GPU (make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 test):

Docking time Original Tensor
In seconds 0.8 0.6

Experiments on RTX3050Ti GPU (make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test):

Docking time Original Tensor
In seconds 2.4 1.7

The baseline implementation for this PR has been taken from this paper: Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores. The contribution of both authors, Gabin Schieffer (@gabin-s) and Ivy Peng (@bopkth), is acknowledged in this PR as well:

Schieffer, Gabin, and Peng, Ivy. "Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores."
In European Conference on Parallel Processing, pp. 608-622. Cham: Springer Nature Switzerland, 2023.

L30nardoSV avatar Jan 11 '24 14:01 L30nardoSV

@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run.

atillack avatar Jan 11 '24 16:01 atillack

@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run.

OK, let me know if the TENSOR directive in commit 10b07fa6a suffices

L30nardoSV avatar Jan 11 '24 18:01 L30nardoSV

@L30nardoSV I tested on one of our Nvidia Quadro RTX A5000 cards and I do see a nice speedup for the 3ce3 example input:

Docking time of the PR with make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test is 0.70 seconds vs 0.90 seconds (this does use the heuristics and autostop by default).

To evaluate a bit further I used Diogo's test set of 42 ligands, here are the results:

Reference:

Path NUMWI AutoStop & Heuristics overall evals energy rmsd docking idle
OpenCL 128 no, 2.5M evals 105382018 36 / 42 good 36 / 42 good 91.69 s 0.17 s
Cuda 128 no, 2.5M evals 105331182 36 / 42 good 36 / 42 good 90.75 s 0.32 s
Cuda 64 no, 2.5M evals 105404961 36 / 42 good 35 / 42 good 106.24 s 7.93 s
OpenCL 128 yes 84026192 37 / 42 good 37 / 42 good 187.82 s 0.21 s
Cuda 128 yes 80037847 38 / 42 good 38 / 42 good 184.85 s 0.20 s
Cuda 64 yes 84684628 36 / 42 good 38 / 42 good 233.39 s 8.13 s

This PR:

Path NUMWI AutoStop & Heuristics overall evals energy rmsd docking idle
OpenCL 128 no, 2.5M evals 105362595 38 / 42 good 36 / 42 good 92.33 s 0.21 s
Cuda 128 no, 2.5M evals 105177642 35 / 42 good 36 / 42 good 100.71 s 0.20 s
Cuda 64 no, 2.5M evals 105197433 35 / 42 good 38 / 42 good 112.48 s 0.19 s
OpenCL 128 yes 86495325 37 / 42 good 38 / 42 good 192.71 s 0.21 s
Cuda 128 yes 71419809 33 / 42 good 37 / 42 good 182.30 s 0.21 s
Cuda 64 yes 65754981 34 / 42 good 37 / 42 good 214.60 s 0.22 s

For multiple differently sized ligands with the typical settings It turns out for larger systems the speedup can turn into a slowdown.

It looks like the average number of evals w/ AutoStop changed in the PR which could potentially point to a minute difference in calculation (i did test multiple times to make sure this wasn't just an unlucky run).

@diogomart Please run your E50 tests for the Cuda version.

atillack avatar Jan 11 '24 19:01 atillack

@L30nardoSV Thank you for the encapsulation :-)

atillack avatar Jan 11 '24 19:01 atillack

Unfortunately, algorithmic performance is worse.

79f13c7-ocl-128wi_vs_PR252-10b07fa-cuda-tensor-128wi-overlap

diogomart avatar Jan 12 '24 17:01 diogomart

@atillack

Can you please check commit b2ab3fe that incorporates an WMMA Extension for single precision matmul on Tensor Cores + error correction (TCEC)?

make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 TENSOR=ON TCEC=ON test

Ref: https://github.com/wmmae/wmma_extension/blob/main/docs/mma_f32.md

L30nardoSV avatar Feb 16 '24 19:02 L30nardoSV

@L30nardoSV I ran the newest version and here are the results (with OpenCL from before as comparison, note: i compiled w/o OVERLAP so the last column takes a bit longer, but compute times are unaffected):

Path NUMWI AutoStop & Heuristics overall evals energy rmsd docking idle
OpenCL 128 yes 86495325 37 / 42 good 38 / 42 good 192.71 s 0.21 s
Cuda 128 yes 88164353 36 / 42 good 38 / 42 good 194.13 s 7.74 s
Cuda 64 yes 77884078 37 / 42 good 37 / 42 good 214.27 s 21.54 s

atillack avatar Feb 20 '24 20:02 atillack

While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total).

atillack avatar Feb 20 '24 20:02 atillack

While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total).

Thanks, I look forward to seeing whether the search efficiency is fine at least

L30nardoSV avatar Mar 04 '24 13:03 L30nardoSV

I'll get to this soon

diogomart avatar Mar 04 '24 19:03 diogomart