DiffEqGPU.jl icon indicating copy to clipboard operation
DiffEqGPU.jl copied to clipboard

Out of Dynamic GPU memory in EnsembleGPUKernel for higher number of threads when using ContinuousCallback

Open martin-abrudsky opened this issue 2 years ago • 6 comments

Hello, I was testing the new updates to Terminete! with EnsembleGPUKernel. It Works fine with DiscreteCallback, however when using ContinuousCallback I still have the problem, Out of Dynamic GPU memory in EnsembleGPUKernel for higher number of threads. I attach the code used

using StaticArrays
using CUDA
using DiffEqGPU
using NPZ
using OrdinaryDiffEq
using Plots

"""
     pot_central(u,p,t)
     u=[x,dx,y,dy]
     p=[k,m]
"""
function pot_central(u,p,t)
      r3 = ( u[1]^2 + u[3]^2 )^(3/2)
     du1 = u[2]                           # u[2]= dx
     du2 =  -( p[1]*u[1] ) / ( p[2]*r3 )    
     du3 = u[4]                           # u[4]= dy
     du4 =  -( p[1]*u[3] ) / ( p[2]*r3 ) 

     return SVector{4}(du1,du2,du3,du4)
end

T = 100.0
 k = 1.0
 m = 1.0
trajectories = 5_000
u_rand = convert(Array{Float64},npzread("IO_GPU/IO_u0.npy"))

    u0 =  @SVector [2.0; 2.0; 1.0; 1.5]
     p =  @SVector [k,m]                      
 tspan =  (0.0,T)   

     prob = ODEProblem{false}(pot_central,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob, u0 = SVector{4}(u_rand[i,:]).*u0 + @SVector [1.0;1.0;1.0;1.0] )
Ensemble_Problem = EnsembleProblem(prob,prob_func=prob_func,safetycopy=false)


function condition(u,t,integrator)
    R2 = @SVector [4.5,5_000.0]                          # R2=[Rmin2,Rmax2]   
    r2 = u[1]*u[1] + u[3]*u[3]
    (R2[2] - r2)*(r2 - R2[1])#< 0.0
end

affect!(integrator) = terminate!(integrator)
gpu_cb =  ContinuousCallback(condition, affect!;save_positions=(false,false),rootfind=true,interp_points=0,abstol=1e-7,reltol=0)
#gpu_cb =  DiscreteCallback(condition, affect!;save_positions=(false,false))

CUDA.@time sol= solve(Ensemble_Problem,
                                GPUTsit5(),
                                #GPUVern7(),
                                #GPUVern9(),
                                EnsembleGPUKernel(),
                                trajectories = trajectories,
                                batch_size = 10_000,
                                adaptive = false,
                                dt = 0.01,
                                save_everystep = false,
                                callback = gpu_cb,
                                merge_callbacks = true
                                )

martin-abrudsky avatar Jan 09 '23 15:01 martin-abrudsky

What GPU? A100? Is it just the memory scaling? Is it fine with a higher dt?

ChrisRackauckas avatar Jan 09 '23 15:01 ChrisRackauckas

The GPU is A30. This is the error that comes out for trajectories = 50_000 and dt=0.1

Output excedes the [size limit]. Open the full output data [in a text editor]
ERROR: Out of Dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
...
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
Output exceeds the [size limit]. Open the full output data [in a text editor]
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
...
Run Julia on debug level 2 for device stack traces. ERROR: a exception was thrown during kernel execution. Run Julia on debug level 2 for device stack traces. ERROR: a exception was thrown during kernel execution. Run Julia on debug level 2 for devic
Output exceeds the [size limit]. Open the full output data [in a text editor]
e stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
...
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug lev
Output exceeds the [size limit]. Open the full output data [in a text editor]
ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes) ERROR: Out of dynamic GPU memory (trying to allocate 912 bytes)
...
ERROR: a exception was thrown during kernel execution. Run Julia on debug level 2 for device stack traces. ERROR: a exception was thrown during kernel execution. Run Julia on debug level 2 for device stack traces.
Output exceeds the [size limit]. Open the full output data [in a text editor]
ERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 foERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 foERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 ERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 ERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 ERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 fERROR: a (null) was thrown during kernel execution.
...
       Run Julia on debug level 2 foERROR: a (null) was thrown during kernel execution.
       Run Julia on debug level 2 foERROR: a (null) was thrown during kernel execution.
       Run Julia on debug ERROR: a (null) was thrown during kernel execution.
       Run Julia on debug ERROR: a (null) was thrown during kernel execution.
       Run Julia on debug ERROR: a (nu
Output exceeds the [size limit]. Open the full output data [in a text editor]
ll) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debug ERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution. Run Julia on debugERROR: a (null) was thrown during kernel execution.
...
Run Julia on debug level 2 for device staERROR: a (null) was thrown during kernel execution. Run Julia on debug level 2 for device staERROR: a (null) was thrown during kernel execution. Run Julia on debug level 2 for device stERROR: a (null) was thrown during kernel execution. Run Julia on debug level 2 for device stERROR: a (null) was thrown during kernel execution. Run Julia on debug level 2 for devic
Excessive output truncated after 542774 bytes.
Output exceeds the [size limit]. Open the full output data [in a text editor]
KernelException: exception thrown during kernel execution on device NVIDIA A30

Stacktrace:
  [1] check_exceptions()
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/compiler/exceptions.jl:34
  [2] synchronize(stream::CuStream; blocking::Nothing)
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/lib/cudadrv/stream.jl:134
  [3] synchronize
    @ ~/.julia/packages/CUDA/Ey3w2/lib/cudadrv/stream.jl:121 [inlined]
  [4] (::CUDA.var"#185#186"{SVector{4, Float64}, Matrix{SVector{4, Float64}}, Int64, CuArray{SVector{4, Float64}, 2, CUDA.Mem.DeviceBuffer}, Int64, Int64})()
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/array.jl:420
  [5] #context!#63
    @ ~/.julia/packages/CUDA/Ey3w2/lib/cudadrv/state.jl:164 [inlined]
  [6] context!
    @ ~/.julia/packages/CUDA/Ey3w2/lib/cudadrv/state.jl:159 [inlined]
  [7] unsafe_copyto!(dest::Matrix{SVector{4, Float64}}, doffs::Int64, src::CuArray{SVector{4, Float64}, 2, CUDA.Mem.DeviceBuffer}, soffs::Int64, n::Int64)
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/array.jl:406
  [8] copyto!
    @ ~/.julia/packages/CUDA/Ey3w2/src/array.jl:360 [inlined]
  [9] copyto!
    @ ~/.julia/packages/CUDA/Ey3w2/src/array.jl:364 [inlined]
 [10] copyto_axcheck!(dest::Matrix{SVector{4, Float64}}, src::CuArray{SVector{4, Float64}, 2, CUDA.Mem.DeviceBuffer})
    @ Base ./abstractarray.jl:1127
 [11] Array
    @ ./array.jl:626 [inlined]
...
    @ ~/.julia/packages/CUDA/Ey3w2/src/utilities.jl:25 [inlined]
 [18] top-level scope
    @ ~/.julia/packages/CUDA/Ey3w2/src/pool.jl:490 [inlined]
 [19] top-level scope
    @ ~/FAMAF/Beca_CIN_Trabajo_Final/skymap/GPU_Julia/pot_central_GPU_Float64.ipynb:0

martin-abrudsky avatar Jan 09 '23 16:01 martin-abrudsky

Smaller batches or higher dt? Did you calculate out the batch memory size requirement?

ChrisRackauckas avatar Jan 09 '23 16:01 ChrisRackauckas

For trajectories=5_000 and dt=0.1, the first time I ran the code it worked, but the second time I get the error

Using DiscreteCallback, I tested it with trajectories=10_000_000 and dt=0.01 and it Works fine. In version 1.24 of the library I had the same error.

martin-abrudsky avatar Jan 09 '23 17:01 martin-abrudsky

It also fails for trajectories = 5_000 dt=0.1 and batch_size = 1_000

martin-abrudsky avatar Jan 09 '23 18:01 martin-abrudsky

This happens due to an allocation within a kernel (in the case of StaticArrays-code typically due to escape analysis going wrong). You can spot it by prefixing code that launches kernels with @device_code_llvm dump_module=true and looking for calls to @gpu_gc_pool_alloc or @gpu_malloc:

julia> @device_code_llvm dump_module=true solve(Ensemble_Problem,
                                       GPUTsit5(),
                                       #GPUVern7(),
                                       #GPUVern9(),
                                       EnsembleGPUKernel(),
                                       trajectories = trajectories,
                                       batch_size = 10_000,
                                       adaptive = false,
                                       dt = 0.01,
                                       save_everystep = false,
                                       callback = gpu_cb,
                                       merge_callbacks = true
                                       )
;  @ /home/tim/Julia/depot/packages/DiffEqGPU/JlHvl/src/perform_step/gpu_tsit5_perform_step.jl:85 within `tsit5_kernel`
; ┌ @ /home/tim/Julia/depot/packages/DiffEqGPU/JlHvl/src/integrators/types.jl:320 within `gputsit5_init`
; │┌ @ /home/tim/Julia/depot/packages/DiffEqGPU/JlHvl/src/integrators/types.jl:13 within `GPUTsit5Integrator`
    %31 = call fastcc {}* @gpu_gc_pool_alloc([1 x i64] %state, i64 912)

maleadt avatar Jan 09 '23 18:01 maleadt