CUDA.jl
CUDA.jl copied to clipboard
updating docs to include matrix-vector multiply example
As I work through how to speed-up some of functionality in the SciML codebases using multiple GPUs, I thought I'd add my small experiments as examples for other users of this package. Comments/feedback is welcome if the example(s) shown could be done better.
Codecov Report
Merging #918 (03192b6) into master (eb7c326) will decrease coverage by
0.00%
. The diff coverage isn/a
.
@@ Coverage Diff @@
## master #918 +/- ##
==========================================
- Coverage 77.00% 76.99% -0.01%
==========================================
Files 121 121
Lines 7706 7708 +2
==========================================
+ Hits 5934 5935 +1
- Misses 1772 1773 +1
Impacted Files | Coverage Δ | |
---|---|---|
lib/cusolver/CUSOLVER.jl | 82.00% <0.00%> (-1.34%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update eb7c326...03192b6. Read the comment docs.
Nice example! Any idea why the minimum times show a much more pronounced speed-up? It could be because CUDA.@sync
does a synchronize()
, so only synchronizes the current task. Maybe it should call device_synchronize()
instead, but that's fairly costly.
@maleadt great question. I added device_synchronize()
and it does reduce the variance quite a bit. The updated is probably a more reasonable implementation/benchmark.
@maleadt I added another example that does a reduction over a large array. Surprisingly, the multiple GPU case is significantly slower (although the maximum time is about 1/3 the single GPU case). Perhaps there is a better way to partition the data/computation than I'm doing here?
Nice examples !