CUDA.jl icon indicating copy to clipboard operation
CUDA.jl copied to clipboard

updating docs to include matrix-vector multiply example

Open akashkgarg opened this issue 3 years ago • 5 comments

As I work through how to speed-up some of functionality in the SciML codebases using multiple GPUs, I thought I'd add my small experiments as examples for other users of this package. Comments/feedback is welcome if the example(s) shown could be done better.

akashkgarg avatar May 20 '21 17:05 akashkgarg

Codecov Report

Merging #918 (03192b6) into master (eb7c326) will decrease coverage by 0.00%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #918      +/-   ##
==========================================
- Coverage   77.00%   76.99%   -0.01%     
==========================================
  Files         121      121              
  Lines        7706     7708       +2     
==========================================
+ Hits         5934     5935       +1     
- Misses       1772     1773       +1     
Impacted Files Coverage Δ
lib/cusolver/CUSOLVER.jl 82.00% <0.00%> (-1.34%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update eb7c326...03192b6. Read the comment docs.

codecov[bot] avatar May 20 '21 18:05 codecov[bot]

Nice example! Any idea why the minimum times show a much more pronounced speed-up? It could be because CUDA.@sync does a synchronize(), so only synchronizes the current task. Maybe it should call device_synchronize() instead, but that's fairly costly.

maleadt avatar May 21 '21 07:05 maleadt

@maleadt great question. I added device_synchronize() and it does reduce the variance quite a bit. The updated is probably a more reasonable implementation/benchmark.

akashkgarg avatar May 24 '21 16:05 akashkgarg

@maleadt I added another example that does a reduction over a large array. Surprisingly, the multiple GPU case is significantly slower (although the maximum time is about 1/3 the single GPU case). Perhaps there is a better way to partition the data/computation than I'm doing here?

akashkgarg avatar May 25 '21 19:05 akashkgarg

Nice examples !

amontoison avatar Oct 11 '21 22:10 amontoison