Anton Smirnov comments

Results 213 comments of


                                            Anton Smirnov

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host

To avoid changing method signatures we can just change [this](https://github.com/JuliaGPU/CUDA.jl/blob/229d13f88fece1bd1dd6422575d61edf1e0cb753/lib/cusparse/conversions.jl#L35) line to: ```diff - m=maximum(I), n=maximum(J); + m=maximum(I)[], n=maximum(J)[]; ``` > IIUC

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host

> How much additional pressure does this put on the GC? For the Flux model that I have and use for testing, machine consistently hangs (machine with a single AMD...

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host

> I take it this works because getproperty is forwarded to the inner value? No, `GPUNumber` only inherits `Number` interface. For everything else (like with that `reducer` example) the user...

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host

So I've made it behave as usual when `eltype` is not `Number`, otherwise return `GPUNumber`. I'll also do some more testing and benchmarking to see the impact.

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host

Here's also a timeline for the Flux.jl model for CUDA.jl. Profiling over 20 training steps and explicitly avoiding any host transfers, like visualizing loss values. Before it took ~29 seconds,...

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host

Remaining gaps could be either GC pauses. Running profiling with GC logging enabled: ```julia GC: pause 366.60ms. collected 5.253326MB. incr GC: pause 112.91ms. collected 34.399342MB. full recollect GC: pause 355.50ms....

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host

> You could use NVTX.jl to visualize GC pauses ||Timeline| |-|-| |default gc threads|![image](https://github.com/user-attachments/assets/2c45c0f5-927c-4592-9c2e-b92da0e4392b)| |`--gcthreads=4`|![image](https://github.com/user-attachments/assets/fda780f5-32af-4864-a15b-ca8a42422973)| So it does look like these gaps are GC pauses. I've also logged `maybe_collect` and...

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host

> Is the GC time spent marking/sweeping in Julia, or are the cuMemFreeAsync calls soaking up the time? Selected green region is where `cuMemFreeAsync` happens, so I guess the bigger...

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host

It was [HiFi-GAN](https://arxiv.org/abs/2010.05646) and https://github.com/JuliaDiff/ChainRules.jl/pull/801 probably brings bigger performance gain. And I'm not sure at this point how impactful this PR is in the real-world use-cases, because in the end...

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host

Ah... I accidentally removed my fork of `GPUArrays` and forgot it has this PR... We should reopen it