AutoDock-GPU
AutoDock-GPU copied to clipboard
Speeding up sum reductions in ADADELTA by using Tensor Cores
Hi,
This PR aims to increase the performance of the CUDA version by leveraging the Tensor Cores Units (TCU) present in recent NVIDIA GPUs.
The idea is to re-implement the sum reductions as matrix operations (i.e., by using NVIDIA Warp Matrix Functions), which can be offloaded to TCUs.
Experiments on A100 GPU (make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 test
):
Docking time | Original | Tensor |
---|---|---|
In seconds | 0.8 | 0.6 |
Experiments on RTX3050Ti GPU (make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test
):
Docking time | Original | Tensor |
---|---|---|
In seconds | 2.4 | 1.7 |
The baseline implementation for this PR has been taken from this paper: Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores. The contribution of both authors, Gabin Schieffer (@gabin-s) and Ivy Peng (@bopkth), is acknowledged in this PR as well:
Schieffer, Gabin, and Peng, Ivy. "Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores."
In European Conference on Parallel Processing, pp. 608-622. Cham: Springer Nature Switzerland, 2023.
@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run.
@L30nardoSV Thank you very much, I am currently testing. Please encapsulate the code a bit and make it a compile option so older Cuda versions and cards still compile and run.
OK, let me know if the TENSOR
directive in commit 10b07fa6a suffices
@L30nardoSV I tested on one of our Nvidia Quadro RTX A5000 cards and I do see a nice speedup for the 3ce3 example input:
Docking time of the PR with make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=86 test
is 0.70 seconds vs 0.90 seconds (this does use the heuristics and autostop by default).
To evaluate a bit further I used Diogo's test set of 42 ligands, here are the results:
Reference:
Path | NUMWI | AutoStop & Heuristics | overall evals | energy | rmsd | docking | idle |
---|---|---|---|---|---|---|---|
OpenCL | 128 | no, 2.5M evals | 105382018 | 36 / 42 good | 36 / 42 good | 91.69 s | 0.17 s |
Cuda | 128 | no, 2.5M evals | 105331182 | 36 / 42 good | 36 / 42 good | 90.75 s | 0.32 s |
Cuda | 64 | no, 2.5M evals | 105404961 | 36 / 42 good | 35 / 42 good | 106.24 s | 7.93 s |
OpenCL | 128 | yes | 84026192 | 37 / 42 good | 37 / 42 good | 187.82 s | 0.21 s |
Cuda | 128 | yes | 80037847 | 38 / 42 good | 38 / 42 good | 184.85 s | 0.20 s |
Cuda | 64 | yes | 84684628 | 36 / 42 good | 38 / 42 good | 233.39 s | 8.13 s |
This PR:
Path | NUMWI | AutoStop & Heuristics | overall evals | energy | rmsd | docking | idle |
---|---|---|---|---|---|---|---|
OpenCL | 128 | no, 2.5M evals | 105362595 | 38 / 42 good | 36 / 42 good | 92.33 s | 0.21 s |
Cuda | 128 | no, 2.5M evals | 105177642 | 35 / 42 good | 36 / 42 good | 100.71 s | 0.20 s |
Cuda | 64 | no, 2.5M evals | 105197433 | 35 / 42 good | 38 / 42 good | 112.48 s | 0.19 s |
OpenCL | 128 | yes | 86495325 | 37 / 42 good | 38 / 42 good | 192.71 s | 0.21 s |
Cuda | 128 | yes | 71419809 | 33 / 42 good | 37 / 42 good | 182.30 s | 0.21 s |
Cuda | 64 | yes | 65754981 | 34 / 42 good | 37 / 42 good | 214.60 s | 0.22 s |
For multiple differently sized ligands with the typical settings It turns out for larger systems the speedup can turn into a slowdown.
It looks like the average number of evals w/ AutoStop changed in the PR which could potentially point to a minute difference in calculation (i did test multiple times to make sure this wasn't just an unlucky run).
@diogomart Please run your E50 tests for the Cuda version.
@L30nardoSV Thank you for the encapsulation :-)
Unfortunately, algorithmic performance is worse.
@atillack
Can you please check commit b2ab3fe that incorporates an WMMA Extension for single precision matmul on Tensor Cores + error correction (TCEC)?
make DEVICE=GPU TESTLS=ad NUMWI=64 TARGETS=80 TENSOR=ON TCEC=ON test
Ref: https://github.com/wmmae/wmma_extension/blob/main/docs/mma_f32.md
@L30nardoSV I ran the newest version and here are the results (with OpenCL from before as comparison, note: i compiled w/o OVERLAP so the last column takes a bit longer, but compute times are unaffected):
Path | NUMWI | AutoStop & Heuristics | overall evals | energy | rmsd | docking | idle |
---|---|---|---|---|---|---|---|
OpenCL | 128 | yes | 86495325 | 37 / 42 good | 38 / 42 good | 192.71 s | 0.21 s |
Cuda | 128 | yes | 88164353 | 36 / 42 good | 38 / 42 good | 194.13 s | 7.74 s |
Cuda | 64 | yes | 77884078 | 37 / 42 good | 37 / 42 good | 214.27 s | 21.54 s |
While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total).
While it looks like the search efficiency (@diogomart please test) might be OK now, overall there does not seem to be an actual speedup (if you normalize by the number of evaluations total).
Thanks, I look forward to seeing whether the search efficiency is fine at least
I'll get to this soon