DiffEqGPU.jl
DiffEqGPU.jl copied to clipboard
EnsembleGPUArray performance vs EnsembleSerial
Hi! Using the Lorenz example in the README, EnsembleGPUArray seems to be running quite a bit slower than all other methods, including EnsembleSerial. On my machine I get:
using DiffEqGPU, OrdinaryDiffEq
function lorenz(du,u,p,t)
du[1] = p[1]*(u[2]-u[1])
du[2] = u[1]*(p[2]-u[3]) - u[2]
du[3] = u[1]*u[2] - p[3]*u[3]
end
u0 = Float32[1.0;0.0;0.0]
tspan = (0.0f0,100.0f0)
p = [10.0f0,28.0f0,8/3f0]
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false)
@time sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=10_000,saveat=1.0f0)
# 8.197300 seconds (21.42 M allocations: 1.551 GiB, 5.59% gc time)
@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)
# 45.863792 seconds (118.46 M allocations: 7.534 GiB, 4.07% gc time, 8.85% compilation time)
Currently on DiffEqGPU v.1.16.0 and OrdinaryDiffEq v6.6.6. GPU is an NVIDIA Quadro T2000, CUDA version 11.6.
It looks like you might be hitting a lot of compilation time? I am not sure if @time counts GPUCompiler compilation time in its reporting of "8.85% compilation time".
Exchanging @time for @btime I get
@btime sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=10_000,saveat=1.0f0)
688.809 ms (1524808 allocations: 149.89 MiB)
@btime sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)
434.094 ms (1304064 allocations: 880.95 MiB)
My GPU is a 2060.
I am not sure why @time is used in the README.
Looks like that accounts for a lot of it, though on my machine EnsembleGPUArray is still slower:
@btime sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=10_000,saveat=1.0f0)
1.086 s (2544793 allocations: 201.55 MiB)
@btime sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)
1.449 s (1559235 allocations: 895.31 MiB)
For another comparison, I tried running the example multi-GPU script (with CUDA replacing CuArrays) on a machine with two GV100's and got the same kind of performance difference:
using DiffEqGPU, CUDA, OrdinaryDiffEq, Test, BenchmarkTools
CUDA.device!(0)
using Distributed
addprocs(2)
@everywhere using DiffEqGPU, CUDA, OrdinaryDiffEq, Test, Random
@everywhere begin
function lorenz_distributed(du,u,p,t)
du[1] = p[1]*(u[2]-u[1])
du[2] = u[1]*(p[2]-u[3]) - u[2]
du[3] = u[1]*u[2] - p[3]*u[3]
end
CUDA.allowscalar(false)
u0 = Float32[1.0;0.0;0.0]
tspan = (0.0f0,100.0f0)
p = [10.0f0,28.0f0,8/3f0]
Random.seed!(1)
pre_p_distributed = [rand(Float32,3) for i in 1:100_000]
function prob_func_distributed(prob,i,repeat)
remake(prob,p=pre_p_distributed[i].*p)
end
end
@sync begin
@spawnat 2 begin
CUDA.allowscalar(false)
CUDA.device!(0)
end
@spawnat 3 begin
CUDA.allowscalar(false)
CUDA.device!(1)
end
end
CUDA.allowscalar(false)
prob = ODEProblem(lorenz_distributed,u0,tspan,p)
monteprob = EnsembleProblem(prob, prob_func = prob_func_distributed)
@btime sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=100_000,batch_size=50_000,saveat=1.0f0)
# 14.605 s (26457532 allocations: 2.08 GiB)
@btime sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,batch_size=50_000,saveat=1.0f0)
# 104.737 s (189837890 allocations: 38.78 GiB)
It looks like you might be hitting a lot of compilation time? I am not sure if @time counts GPUCompiler compilation time in its reporting of "8.85% compilation time".
It doesn't.
But note the current setup isn't great so we're building a new one which is better for non-stiff ODEs.
Got it, and thanks for the responses!
Also, I noticed that both EnsembleSerial and EnsembleGPUArray cause my GPU to jump to ~13% utilization. Is that normal? I would expect the utilization to be much higher for EnsembleGPUArray.
EnsembleSerial doesn't use the GPU unless you're using GPU arrays. The utilization is dependent on how packaged the kernels are. You want to use like 100,000 trajectories and a big enough ODE to pack the kernels with the current version. That's why we're building a different one that's a lot less limited.
EnsembleGPUKernel is a lot faster, so that's the one to make use of.