GPUArrays.jl icon indicating copy to clipboard operation
GPUArrays.jl copied to clipboard

Introduce `AsyncNumber` to lazily copy numeric `mapreduce` results to the host

Open pxl-th opened this issue 1 year ago • 18 comments

Introduce GPUNumber to store the resul of mapreduce across all dims (i.e. dims = :) instead of immediately transferring it to host.

GPUNumber copies its value to host when it is used as a regular number. But if it is used for broadcasting, it passes its underlying 1-element array without device-to-host copy.

Seems to be minimally breaking change, except for the return type of sum and the likes. Using AbstractNumbers.jl since it implements the whole Number interface conveniently.

More context: https://github.com/FluxML/Zygote.jl/issues/1513

  • Before:
julia> x = AMDGPU.rand(Float32, 1024);

julia> @btime sum($x);
  64.340 μs (146 allocations: 6.11 KiB)

julia> @btime Zygote.gradient($x) do x
           sum(1f0 .- x)
       end;
  131.617 μs (368 allocations: 16.10 KiB)
  • Now:
julia> @btime sum($x);
  15.809 μs (140 allocations: 5.84 KiB)

julia> @btime Zygote.gradient($x) do x
           sum(1f0 .- x)
       end;
  44.123 μs (356 allocations: 15.57 KiB)

Here's also a timeline profile for the Flux.jl model for the same region. We can see that now there are no device-to-host copies.

Timeline profile
Before image
Now image

pxl-th avatar Jul 06 '24 21:07 pxl-th

@maleadt, the remaining CUDA.jl issues are because of method signatures, which can easily be updated to handle the change. I can open respective PR in CUDA.jl if you think this change is fine

pxl-th avatar Jul 07 '24 21:07 pxl-th

Surprisingly small change! Still not a fan though :-P How much additional pressure does this put on the GC?

which can easily be updated to handle the change

Only without losing some "type safety", right? IIUC <:Integer arguments have to be widened to <:Number, or similarly <:Real or <:Complex types would get lost.

maleadt avatar Jul 08 '24 09:07 maleadt

To avoid changing method signatures we can just change this line to:

- m=maximum(I), n=maximum(J);
+ m=maximum(I)[], n=maximum(J)[];

IIUC <:Integer arguments have to be widened to <:Number, or similarly <:Real or <:Complex types would get lost.

You mean that the compiler wouldn't be able to infer the type? I think it is type-stable, since we maintain the underlying type in GPUNumber{T} and GPUNumber just forwards through a thin layer.

julia> x = AMDGPU.rand(Float32, 16);

julia> f(x::Number) = x + 2
f (generic function with 1 method)

julia> @code_warntype sum(x)
MethodInstance for sum(::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer})
  from sum(a::AbstractArray; dims, kw...) @ Base reducedim.jl:1010
Arguments
  #self#::Core.Const(sum)
  a::ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}
Body::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
1 ─      nothing
│   %2 = Base.:(:)::Core.Const(Colon())
│   %3 = Core.NamedTuple()::Core.Const(NamedTuple())
│   %4 = Base.pairs(%3)::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}())
│   %5 = Base.:(var"#sum#828")(%2, %4, #self#, a)::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
└──      return %5

julia> @code_warntype(f(sum(x)))
MethodInstance for f(::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}})
  from f(x::Number) @ Main REPL[3]:1
Arguments
  #self#::Core.Const(f)
  x::GPUArrays.GPUNumber{ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}}
Body::Float32
1 ─ %1 = (x + 2)::Float32
└──      return %1

pxl-th avatar Jul 08 '24 16:07 pxl-th

How much additional pressure does this put on the GC?

For the Flux model that I have and use for testing, machine consistently hangs (machine with a single AMD GPU) after several hundred training steps with or without this change... So I'm not sure it should be a blocker for this PR.

But generally, I really do think that something needs to be done to GC and GPU, since this is the biggest blocker for Julia + DL right now, unless you have 2+ GPUs or use headless server...

pxl-th avatar Jul 08 '24 16:07 pxl-th

IIUC <:Integer arguments have to be widened to <:Number, or similarly <:Real or <:Complex types would get lost.

You mean that the compiler wouldn't be able to infer the type?

I'm not thinking of type stability or inference, but the ability to restrict method definitions (e.g. for dispatch purposes). If we now have foo(::Real) and foo(::Complex) we lose the ability to distinguish between both by doing foo(::GPUNumber). Or when you have a method bar(::CuArray{T}, ::T) where {T <: Number} signalling that the (el)types need to match; this would become impossible too.

The fact that we're now indicating that mapreduce returns a Number also seems wrong to me too. It's possible to reduce non-numbers, e.g., https://github.com/JuliaGPU/GPUArrays.jl/blob/80b82267d15a9ae10d90228bb3899b2be68d996e/src/host/mapreduce.jl#L111-L134. I take it this works because getproperty is forwarded to the inner value? That doesn't seem scalable though, as you can't forward everything to the inner value (and with arbitrary values being possible here, it's impossible to tell what to forward).

Not that I wouldn't want this to work, I just want to avoid breaking code. The fact that tests seem relatively OK is a good argument in favor though, and worst case we can always revert what's ultimately just an optimization.

maleadt avatar Jul 08 '24 18:07 maleadt

I take it this works because getproperty is forwarded to the inner value?

No, GPUNumber only inherits Number interface. For everything else (like with that reducer example) the user would need to explicitly tranfer value to host by calling getindex sum(x)[] (or AN.number as is in this PR).

The fact that we're now indicating that mapreduce returns a Number also seems wrong to me too.

Maybe we can have GPUNumber for numbers and some other container for non-Number types?

Or when you have a method bar(::CuArray{T}, ::T) where {T <: Number} signalling that the (el)types need to match; this would become impossible too.

Yeah, that's a consequence of making host transfers more explicit I guess... Maybe we can expose something like (as defined in this PR):

maybe_number(g::GPUNumber) = AN.number(g)
maybe_number(g) = g

So that if the user writes code both for CPU and GPU he can use maybe_number (or some other name, like materialize) when needed to make sure the type is actually a scalar.

pxl-th avatar Jul 08 '24 20:07 pxl-th

For everything else (like with that reducer example) the user would need to explicitly tranfer value to host by calling getindex sum(x)[]

But that's a breaking (and non-generic) change? So yeah, making this happen only for Numbers where we can be "sure" we implement the number interface transparently seems like the safer choice. But even then I'm still wary that this will break user code, so this will need thorough testing...

maleadt avatar Jul 09 '24 09:07 maleadt

So I've made it behave as usual when eltype is not Number, otherwise return GPUNumber. I'll also do some more testing and benchmarking to see the impact.

pxl-th avatar Jul 09 '24 12:07 pxl-th

Also, some bike shedding: AsyncNumber might be more clear than the otherwise noninformative GPUNumber. Or AsyncValue if we find a way to generalize this, which I'm not sure we will. Also sounds a bit like a Ref, but that verb is pretty overloaded already.

maleadt avatar Jul 09 '24 14:07 maleadt

Here's also a timeline for the Flux.jl model for CUDA.jl. Profiling over 20 training steps and explicitly avoiding any host transfers, like visualizing loss values. Before it took ~29 seconds, now ~19 seconds. This includes some other PRs that avoid host transfers:

  • https://github.com/JuliaDiff/ChainRules.jl/pull/801
  • https://github.com/JuliaGPU/GPUArrays.jl/issues/542 (although it still does bounds checking in ChainRules.jl so I disabled it manually)

When running outside of profiler, one epoch takes ~35-40 minutes (down from 50-55 min), while PyTorch takes ~12 minutes per epoch. So there's still room for improvement (timeline profiling also looks much denser for PyTorch, there are no gaps between GPU kernels).

Timeline
Before image
Now image
PyTorch image

pxl-th avatar Jul 11 '24 13:07 pxl-th

Remaining gaps could be either GC pauses. Running profiling with GC logging enabled:

GC: pause 366.60ms. collected 5.253326MB. incr 

GC: pause 112.91ms. collected 34.399342MB. full recollect

GC: pause 355.50ms. collected 0.455795MB. incr 

GC: pause 114.99ms. collected 30.454090MB. full recollect

GC: pause 371.72ms. collected 0.405960MB. incr 

GC: pause 118.76ms. collected 30.621845MB. full recollect

GC: pause 359.00ms. collected 5.808136MB. incr 

GC: pause 124.97ms. collected 10.381500MB. full recollect

GC: pause 348.96ms. collected 5.253326MB. incr 

GC: pause 112.76ms. collected 37.985279MB. full recollect

GC: pause 348.96ms. collected 0.455795MB. incr 

GC: pause 111.61ms. collected 33.037785MB. full recollect

GC: pause 349.60ms. collected 0.405960MB. incr 

Or synchronizations during host -> device transfer. Could be that PyTorch's data loader prefetches data on a separate worker/stream to overlap with compute (don't think it is using pinned memory by default).

pxl-th avatar Jul 11 '24 14:07 pxl-th

You could use NVTX.jl to visualize GC pauses, https://juliagpu.github.io/NVTX.jl/dev/#Instrumenting-Julia-internals, and maybe add a couple of annotations to key CUDA.jl functions.

maleadt avatar Jul 12 '24 09:07 maleadt

You could use NVTX.jl to visualize GC pauses

Timeline
default gc threads image
--gcthreads=4 image

So it does look like these gaps are GC pauses. I've also logged maybe_collect and it almost always corresponds to one of these GC pauses and the pressure when it is triggered is always > 0.99.

pxl-th avatar Jul 12 '24 13:07 pxl-th

Is the GC time spent marking/sweeping in Julia, or are the cuMemFreeAsync calls soaking up the time? I would assume the former, but I've also seen CUDA itself spending significant time in the allocator especially when close to OOM (presumably compacting/defragmenting its heap).

maleadt avatar Jul 12 '24 15:07 maleadt

So it does look like these gaps are GC pauses.

Can you set --gcthreads to be 4 or so?

vchuravy avatar Jul 12 '24 15:07 vchuravy

Is the GC time spent marking/sweeping in Julia, or are the cuMemFreeAsync calls soaking up the time?

Selected green region is where cuMemFreeAsync happens, so I guess the bigger portion of time is spent on marking/sweeping?

image

Can you set --gcthreads to be 4 or so?

@vchuravy I've updated the above post with --gcthreads=4 option. The GC pause time decreased from ~500ms to ~300ms, but for the whole training epoch the improvement is maybe a couple of minutes (1 epoch is still taking ~40 min to run)

pxl-th avatar Jul 12 '24 16:07 pxl-th

@pxl-th what model are you training? I'm curious how much of the performance gain is due to AsyncNumber vs https://github.com/JuliaDiff/ChainRules.jl/pull/801

vpuri3 avatar Sep 07 '24 17:09 vpuri3

It was HiFi-GAN and https://github.com/JuliaDiff/ChainRules.jl/pull/801 probably brings bigger performance gain. And I'm not sure at this point how impactful this PR is in the real-world use-cases, because in the end the real bottleneck was due to GC calls when we ran out of memory as described above https://github.com/JuliaGPU/GPUArrays.jl/pull/550#issuecomment-2225915738.

pxl-th avatar Sep 07 '24 18:09 pxl-th

Why the close? Did this turn out to hurt GC too much?

maleadt avatar Nov 18 '24 08:11 maleadt

Ah... I accidentally removed my fork of GPUArrays and forgot it has this PR... We should reopen it

pxl-th avatar Nov 18 '24 08:11 pxl-th

I've invited you to the AMDGPU team and give you access to GPUArrays.jl, so you can just host your branch here if you want.

maleadt avatar Nov 18 '24 09:11 maleadt

Thanks!

pxl-th avatar Nov 18 '24 13:11 pxl-th