GPUArrays.jl Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host

Introduce GPUNumber to store the resul of mapreduce across all dims (i.e. dims = :) instead of immediately transferring it to host.

GPUNumber copies its value to host when it is used as a regular number. But if it is used for broadcasting, it passes its underlying 1-element array without device-to-host copy.

Seems to be minimally breaking change, except for the return type of sum and the likes. Using AbstractNumbers.jl since it implements the whole Number interface conveniently.

More context: https://github.com/FluxML/Zygote.jl/issues/1513

Before:

julia> x = AMDGPU.rand(Float32, 1024);

julia> @btime sum($x);
  64.340 μs (146 allocations: 6.11 KiB)

julia> @btime Zygote.gradient($x) do x
           sum(1f0 .- x)
       end;
  131.617 μs (368 allocations: 16.10 KiB)

Now:

julia> @btime sum($x);
  15.809 μs (140 allocations: 5.84 KiB)

julia> @btime Zygote.gradient($x) do x
           sum(1f0 .- x)
       end;
  44.123 μs (356 allocations: 15.57 KiB)

Here's also a timeline profile for the Flux.jl model for the same region. We can see that now there are no device-to-host copies.

	Timeline profile
Before
Now

Jul 06 '24 21:07 pxl-th

@maleadt, the remaining CUDA.jl issues are because of method signatures, which can easily be updated to handle the change. I can open respective PR in CUDA.jl if you think this change is fine

Jul 07 '24 21:07 pxl-th

Surprisingly small change! Still not a fan though :-P How much additional pressure does this put on the GC?

which can easily be updated to handle the change

Only without losing some "type safety", right? IIUC <:Integer arguments have to be widened to <:Number, or similarly <:Real or <:Complex types would get lost.

Jul 08 '24 09:07 maleadt

To avoid changing method signatures we can just change this line to:

- m=maximum(I), n=maximum(J);
+ m=maximum(I)[], n=maximum(J)[];

IIUC <:Integer arguments have to be widened to <:Number, or similarly <:Real or <:Complex types would get lost.

You mean that the compiler wouldn't be able to infer the type? I think it is type-stable, since we maintain the underlying type in GPUNumber{T} and GPUNumber just forwards through a thin layer.

julia> x = AMDGPU.rand(Float32, 16);

julia> f(x::Number) = x + 2
f (generic function with 1 method)

julia> @code_warntype sum(x)
MethodInstance for sum(::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer})
  from sum(a::AbstractArray; dims, kw...) @ Base reducedim.jl:1010
Arguments
  #self#::Core.Const(sum)
  a::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}
Body::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
1 ─      nothing
│   %2 = Base.:(:)::Core.Const(Colon())
│   %3 = Core.NamedTuple()::Core.Const(NamedTuple())
│   %4 = Base.pairs(%3)::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}())
│   %5 = Base.:(var"#sum#828")(%2, %4, #self#, a)::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
└──      return %5

julia> @code_warntype(f(sum(x)))
MethodInstance for f(::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}})
  from f(x::Number) @ Main REPL[3]:1
Arguments
  #self#::Core.Const(f)
  x::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
Body::Float32
1 ─ %1 = (x + 2)::Float32
└──      return %1

Jul 08 '24 16:07 pxl-th

How much additional pressure does this put on the GC?

For the Flux model that I have and use for testing, machine consistently hangs (machine with a single AMD GPU) after several hundred training steps with or without this change... So I'm not sure it should be a blocker for this PR.

But generally, I really do think that something needs to be done to GC and GPU, since this is the biggest blocker for Julia + DL right now, unless you have 2+ GPUs or use headless server...

Jul 08 '24 16:07 pxl-th

IIUC <:Integer arguments have to be widened to <:Number, or similarly <:Real or <:Complex types would get lost.

You mean that the compiler wouldn't be able to infer the type?

I'm not thinking of type stability or inference, but the ability to restrict method definitions (e.g. for dispatch purposes). If we now have foo(::Real) and foo(::Complex) we lose the ability to distinguish between both by doing foo(::GPUNumber). Or when you have a method bar(::CuArray{T}, ::T) where {T <: Number} signalling that the (el)types need to match; this would become impossible too.

The fact that we're now indicating that mapreduce returns a Number also seems wrong to me too. It's possible to reduce non-numbers, e.g., https://github.com/JuliaGPU/GPUArrays.jl/blob/80b82267d15a9ae10d90228bb3899b2be68d996e/src/host/mapreduce.jl#L111-L134. I take it this works because getproperty is forwarded to the inner value? That doesn't seem scalable though, as you can't forward everything to the inner value (and with arbitrary values being possible here, it's impossible to tell what to forward).

Not that I wouldn't want this to work, I just want to avoid breaking code. The fact that tests seem relatively OK is a good argument in favor though, and worst case we can always revert what's ultimately just an optimization.

Jul 08 '24 18:07 maleadt

I take it this works because getproperty is forwarded to the inner value?

No, GPUNumber only inherits Number interface. For everything else (like with that reducer example) the user would need to explicitly tranfer value to host by calling getindex sum(x)[] (or AN.number as is in this PR).

The fact that we're now indicating that mapreduce returns a Number also seems wrong to me too.

Maybe we can have GPUNumber for numbers and some other container for non-Number types?

Or when you have a method bar(::CuArray{T}, ::T) where {T <: Number} signalling that the (el)types need to match; this would become impossible too.

Yeah, that's a consequence of making host transfers more explicit I guess... Maybe we can expose something like (as defined in this PR):

maybe_number(g::GPUNumber) = AN.number(g)
maybe_number(g) = g

So that if the user writes code both for CPU and GPU he can use maybe_number (or some other name, like materialize) when needed to make sure the type is actually a scalar.

Jul 08 '24 20:07 pxl-th

For everything else (like with that reducer example) the user would need to explicitly tranfer value to host by calling getindex sum(x)[]

But that's a breaking (and non-generic) change? So yeah, making this happen only for Numbers where we can be "sure" we implement the number interface transparently seems like the safer choice. But even then I'm still wary that this will break user code, so this will need thorough testing...

Jul 09 '24 09:07 maleadt

So I've made it behave as usual when eltype is not Number, otherwise return GPUNumber. I'll also do some more testing and benchmarking to see the impact.

Jul 09 '24 12:07 pxl-th

Also, some bike shedding: AsyncNumber might be more clear than the otherwise noninformative GPUNumber. Or AsyncValue if we find a way to generalize this, which I'm not sure we will. Also sounds a bit like a Ref, but that verb is pretty overloaded already.

Jul 09 '24 14:07 maleadt

Here's also a timeline for the Flux.jl model for CUDA.jl. Profiling over 20 training steps and explicitly avoiding any host transfers, like visualizing loss values. Before it took ~29 seconds, now ~19 seconds. This includes some other PRs that avoid host transfers:

https://github.com/JuliaDiff/ChainRules.jl/pull/801
https://github.com/JuliaGPU/GPUArrays.jl/issues/542 (although it still does bounds checking in ChainRules.jl so I disabled it manually)

When running outside of profiler, one epoch takes ~35-40 minutes (down from 50-55 min), while PyTorch takes ~12 minutes per epoch. So there's still room for improvement (timeline profiling also looks much denser for PyTorch, there are no gaps between GPU kernels).

	Timeline
Before
Now
PyTorch

Jul 11 '24 13:07 pxl-th

Remaining gaps could be either GC pauses. Running profiling with GC logging enabled:

GC: pause 366.60ms. collected 5.253326MB. incr 

GC: pause 112.91ms. collected 34.399342MB. full recollect

GC: pause 355.50ms. collected 0.455795MB. incr 

GC: pause 114.99ms. collected 30.454090MB. full recollect

GC: pause 371.72ms. collected 0.405960MB. incr 

GC: pause 118.76ms. collected 30.621845MB. full recollect

GC: pause 359.00ms. collected 5.808136MB. incr 

GC: pause 124.97ms. collected 10.381500MB. full recollect

GC: pause 348.96ms. collected 5.253326MB. incr 

GC: pause 112.76ms. collected 37.985279MB. full recollect

GC: pause 348.96ms. collected 0.455795MB. incr 

GC: pause 111.61ms. collected 33.037785MB. full recollect

GC: pause 349.60ms. collected 0.405960MB. incr

Or synchronizations during host -> device transfer. Could be that PyTorch's data loader prefetches data on a separate worker/stream to overlap with compute (don't think it is using pinned memory by default).

Jul 11 '24 14:07 pxl-th

You could use NVTX.jl to visualize GC pauses, https://juliagpu.github.io/NVTX.jl/dev/#Instrumenting-Julia-internals, and maybe add a couple of annotations to key CUDA.jl functions.

Jul 12 '24 09:07 maleadt

You could use NVTX.jl to visualize GC pauses

	Timeline
default gc threads
`--gcthreads=4`

So it does look like these gaps are GC pauses. I've also logged maybe_collect and it almost always corresponds to one of these GC pauses and the pressure when it is triggered is always > 0.99.

Jul 12 '24 13:07 pxl-th

Is the GC time spent marking/sweeping in Julia, or are the cuMemFreeAsync calls soaking up the time? I would assume the former, but I've also seen CUDA itself spending significant time in the allocator especially when close to OOM (presumably compacting/defragmenting its heap).

Jul 12 '24 15:07 maleadt

So it does look like these gaps are GC pauses.

Can you set --gcthreads to be 4 or so?

Jul 12 '24 15:07 vchuravy

Is the GC time spent marking/sweeping in Julia, or are the cuMemFreeAsync calls soaking up the time?

Selected green region is where cuMemFreeAsync happens, so I guess the bigger portion of time is spent on marking/sweeping?

Can you set --gcthreads to be 4 or so?

@vchuravy I've updated the above post with --gcthreads=4 option. The GC pause time decreased from ~500ms to ~300ms, but for the whole training epoch the improvement is maybe a couple of minutes (1 epoch is still taking ~40 min to run)

Jul 12 '24 16:07 pxl-th

@pxl-th what model are you training? I'm curious how much of the performance gain is due to AsyncNumber vs https://github.com/JuliaDiff/ChainRules.jl/pull/801

Sep 07 '24 17:09 vpuri3

It was HiFi-GAN and https://github.com/JuliaDiff/ChainRules.jl/pull/801 probably brings bigger performance gain. And I'm not sure at this point how impactful this PR is in the real-world use-cases, because in the end the real bottleneck was due to GC calls when we ran out of memory as described above https://github.com/JuliaGPU/GPUArrays.jl/pull/550#issuecomment-2225915738.

Sep 07 '24 18:09 pxl-th

Why the close? Did this turn out to hurt GC too much?

Nov 18 '24 08:11 maleadt

Ah... I accidentally removed my fork of GPUArrays and forgot it has this PR... We should reopen it

Nov 18 '24 08:11 pxl-th

I've invited you to the AMDGPU team and give you access to GPUArrays.jl, so you can just host your branch here if you want.

Nov 18 '24 09:11 maleadt

Thanks!

Nov 18 '24 13:11 pxl-th