DiffEqGPU.jl GPU Parallel Ensemble Simulations with generic linear algebra

Hello,

I'm trying to solve and ODE with my GPU. I tested the example code present in the DiffEqGPU.js repository, and it worked fine.

I have the following function to implement in the ODE Problem

function drho_dt(rho, p, t)
    A, w_l= p
    return ( L + A * sin(w_l * t) * L_t ) * rho
end

or equivalently

function drho_dt(drho, rho, p, t)
    A, w_l = p
    mul!(drho, L + A * sin(w_l * t) * L_t, rho)
    return nothing
end

where A and w_l are two numbers, while L and L_t are two global matrices. Following the documentation, I used the following code to run a Parallel Ensemble:

A = 0.05
w_l = 1.0

p = [A, w_l]

tspan = (0.0, 10.0 / gam_q)

w_l_l = LinRange{Float64}(0.1, 1.9, 10)

function prob_func(prob,i,repeat)
  remake(prob,p=[prob.p[1], w_l_l[i]])
end

prob = ODEProblem(drho_dt, rho0_vec, tspan, p)
ensemble_prob = EnsembleProblem(prob, prob_func=prob_func, safetycopy=false)
@time sim = solve(ensemble_prob, BS3(), METHOD!!, trajectories=length(w_l_l))

where METHOD!! can be one of the methods, e.g. EnsembleSerial(), EnsembleThreads(), EnsembleCPUArray() or EnsembleGPUArray().

The problem is divided in two cases:

When L and L_t are sparse matrices. In this case all the methods work, except for EnsembleGPUArray(). In this case the error is

InvalidIRError: compiling kernel gpu_gpu_kernel_oop(Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel_oop), typeof(drho_dt), CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{Float64, 1}, Float64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] overdub
   @ ./In[8]:62
 [2] macro expansion
   @ ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:25
 [3] overdub
   @ ~/.julia/packages/KernelAbstractions/Yy47c/src/macros.jl:80
 [4] overdub
   @ ~/.julia/packages/Cassette/N5kbV/src/overdub.jl:0
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:27
 [2] overdub
   @ ~/.julia/packages/KernelAbstractions/Yy47c/src/macros.jl:80
 [3] overdub
   @ ~/.julia/packages/Cassette/N5kbV/src/overdub.jl:0

Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Cassette.overdub), Tuple{Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel_oop), typeof(drho_dt), CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{Float64, 1}, Float64}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/wHjqZ/src/validation.jl:111
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/wHjqZ/src/driver.jl:319 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/ZQ0rt/src/TimerOutput.jl:236 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/wHjqZ/src/driver.jl:317 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/wHjqZ/src/utils.jl:62
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/02Kjq/src/compiler/execution.jl:317
  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/wHjqZ/src/cache.jl:89
  [8] cufunction(f::typeof(Cassette.overdub), tt::Type{Tuple{Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel_oop), typeof(drho_dt), CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{Float64, 1}, Float64}}; name::String, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/02Kjq/src/compiler/execution.jl:288
  [9] macro expansion
    @ ~/.julia/packages/CUDA/02Kjq/src/compiler/execution.jl:102 [inlined]
 [10] (::KernelAbstractions.Kernel{CUDAKernels.CUDADevice, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.gpu_gpu_kernel_oop)})(::Function, ::Vararg{Any, N} where N; ndrange::Int64, dependencies::CUDAKernels.CudaEvent, workgroupsize::Int64, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/8wtKq/src/CUDAKernels.jl:194
 [11] SciML/DifferentialEquations.jl#55
    @ ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:414 [inlined]
 [12] ODEFunction
    @ ~/.julia/packages/SciMLBase/oTP8b/src/scimlfunctions.jl:334 [inlined]
 [13] initialize!(integrator::OrdinaryDiffEq.ODEIntegrator{Tsit5, true, CuArray{ComplexF64, 2}, Nothing, Float64, CuArray{Float64, 2}, Float64, Float64, Float64, Float64, Vector{CuArray{ComplexF64, 2}}, ODESolution{ComplexF64, 3, Vector{CuArray{ComplexF64, 2}}, Nothing, Nothing, Vector{Float64}, Vector{Vector{CuArray{ComplexF64, 2}}}, ODEProblem{CuArray{ComplexF64, 2}, Tuple{Float64, Float64}, true, CuArray{Float64, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, Tsit5, OrdinaryDiffEq.InterpolationData{ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{CuArray{ComplexF64, 2}}, Vector{Float64}, Vector{Vector{CuArray{ComplexF64, 2}}}, OrdinaryDiffEq.Tsit5Cache{CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}}}, DiffEqBase.DEStats}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, OrdinaryDiffEq.Tsit5Cache{CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}}, OrdinaryDiffEq.DEOptions{Float64, Float64, Float64, Float64, PIController{Rational{Int64}}, typeof(DiffEqGPU.diffeqgpunorm), typeof(opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), DiffEqGPU.var"#12#18", DataStructures.BinaryMinHeap{Float64}, DataStructures.BinaryMinHeap{Float64}, Nothing, Nothing, Int64, Tuple{}, Tuple{}, Tuple{}}, CuArray{ComplexF64, 2}, ComplexF64, Nothing, OrdinaryDiffEq.DefaultInit}, cache::OrdinaryDiffEq.Tsit5Cache{CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/perform_step/low_order_rk_perform_step.jl:623
 [14] __init(prob::ODEProblem{CuArray{ComplexF64, 2}, Tuple{Float64, Float64}, true, CuArray{Float64, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::Tsit5, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}, recompile::Type{Val{true}}; saveat::Tuple{}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Nothing, callback::Nothing, dense::Bool, calck::Bool, dt::Float64, dtmin::Nothing, dtmax::Float64, force_dtmin::Bool, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmin::Rational{Int64}, qmax::Int64, qsteady_min::Int64, qsteady_max::Int64, beta1::Nothing, beta2::Nothing, qoldinit::Rational{Int64}, controller::Nothing, fullnormalize::Bool, failfactor::Int64, maxiters::Int64, internalnorm::typeof(DiffEqGPU.diffeqgpunorm), internalopnorm::typeof(opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::DiffEqGPU.var"#12#18", verbose::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias_u0::Bool, alias_du0::Bool, initializealg::OrdinaryDiffEq.DefaultInit, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/solve.jl:456
 [15] #__solve#466
    @ ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/solve.jl:4 [inlined]
 [16] #solve_call#58
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:61 [inlined]
 [17] solve_up(prob::ODEProblem{CuArray{ComplexF64, 2}, Tuple{Float64, Float64}, true, CuArray{Float64, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, sensealg::Nothing, u0::CuArray{ComplexF64, 2}, p::CuArray{Float64, 2}, args::Tsit5; kwargs::Base.Iterators.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:unstable_check, :callback, :merge_callbacks, :internalnorm), Tuple{DiffEqGPU.var"#12#18", Nothing, Bool, typeof(DiffEqGPU.diffeqgpunorm)}}})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:82
 [18] #solve#59
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:70 [inlined]
 [19] batch_solve_up(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::Tsit5, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}, u0::Matrix{ComplexF64}, p::Matrix{Float64}; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:319
 [20] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:284
 [21] macro expansion
    @ ./timing.jl:287 [inlined]
 [22] __solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleGPUArray; trajectories::Int64, batch_size::Int64, unstable_check::Function, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:201
 [23] #solve#61
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:96 [inlined]
 [24] top-level scope
    @ ./timing.jl:210 [inlined]
 [25] top-level scope
    @ ./In[9]:0
 [26] eval
    @ ./boot.jl:360 [inlined]
 [27] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094

When L and L_t are dense matrices. In this case not even EnsembleCPUArray() works, but the others yes. The error for the GPU method should be the same. While the error for the CPU one is

TaskFailedException

    nested task error: conversion to pointer not defined for Base.Experimental.Const{ComplexF64, 2}
    Stacktrace:
      [1] error(::String)
        @ ./error.jl:33 [inlined]
      [2] overdub
        @ ./error.jl:33 [inlined]
      [3] unsafe_convert(::Type{Ptr{ComplexF64}}, ::Base.Experimental.Const{ComplexF64, 2})
        @ ./pointer.jl:67 [inlined]
      [4] overdub
        @ ./pointer.jl:67 [inlined]
      [5] unsafe_convert(::Type{Ptr{ComplexF64}}, ::SubArray{ComplexF64, 1, Base.Experimental.Const{ComplexF64, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
        @ ./subarray.jl:427 [inlined]
      [6] overdub
        @ ./subarray.jl:427 [inlined]
      [7] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/blas.jl:704 [inlined]
      [8] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:544 [inlined]
      [9] mul!(::Vector{ComplexF64}, ::Matrix{ComplexF64}, ::SubArray{ComplexF64, 1, Base.Experimental.Const{ComplexF64, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, ::Bool, ::Bool)
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:66 [inlined]
     [10] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:66 [inlined]
     [11] mul!(::Vector{ComplexF64}, ::Matrix{ComplexF64}, ::SubArray{ComplexF64, 1, Base.Experimental.Const{ComplexF64, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:275 [inlined]
     [12] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:275 [inlined]
     [13] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:51 [inlined]
     [14] overdub(::Cassette.Context{nametype(CPUCtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.NoDynamicCheck, CartesianIndex{1}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, ::typeof(*), ::Matrix{ComplexF64}, ::SubArray{ComplexF64, 1, Base.Experimental.Const{ComplexF64, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
        @ Cassette ~/.julia/packages/Cassette/N5kbV/src/overdub.jl:0
     [15] overdub
        @ ./In[11]:62 [inlined]
     [16] macro expansion
        @ ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:25 [inlined]
     [17] overdub
        @ ~/.julia/packages/KernelAbstractions/Yy47c/src/macros.jl:265 [inlined]
     [18] __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.cpu_gpu_kernel_oop)}, ndrange::Tuple{Int64}, iterspace::KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}, args::Tuple{typeof(drho_dt), Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{Float64}, Float64}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
        @ KernelAbstractions ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:157
     [19] __run(obj::KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.cpu_gpu_kernel_oop)}, ndrange::Tuple{Int64}, iterspace::KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}, args::Tuple{typeof(drho_dt), Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{Float64}, Float64}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
        @ KernelAbstractions ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:130
     [20] (::KernelAbstractions.var"#33#34"{Tuple{KernelAbstractions.NoneEvent}, Nothing, typeof(KernelAbstractions.__run), Tuple{KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.cpu_gpu_kernel_oop)}, Tuple{Int64}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}, Tuple{typeof(drho_dt), Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{Float64}, Float64}, KernelAbstractions.NDIteration.NoDynamicCheck}})()
        @ KernelAbstractions ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:22

Stacktrace:
  [1] wait
    @ ./task.jl:322 [inlined]
  [2] wait
    @ ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:65 [inlined]
  [3] wait
    @ ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:64 [inlined]
  [4] (::DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)})(du::Matrix{ComplexF64}, u::Matrix{ComplexF64}, p::Matrix{Float64}, t::Float64)
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:414
  [5] ODEFunction
    @ ~/.julia/packages/SciMLBase/oTP8b/src/scimlfunctions.jl:334 [inlined]
  [6] initialize!(integrator::OrdinaryDiffEq.ODEIntegrator{Tsit5, true, Matrix{ComplexF64}, Nothing, Float64, Matrix{Float64}, Float64, Float64, Float64, Float64, Vector{Matrix{ComplexF64}}, ODESolution{ComplexF64, 3, Vector{Matrix{ComplexF64}}, Nothing, Nothing, Vector{Float64}, Vector{Vector{Matrix{ComplexF64}}}, ODEProblem{Matrix{ComplexF64}, Tuple{Float64, Float64}, true, Matrix{Float64}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, Tsit5, OrdinaryDiffEq.InterpolationData{ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{Matrix{ComplexF64}}, Vector{Float64}, Vector{Vector{Matrix{ComplexF64}}}, OrdinaryDiffEq.Tsit5Cache{Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{ComplexF64}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}}}, DiffEqBase.DEStats}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, OrdinaryDiffEq.Tsit5Cache{Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{ComplexF64}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}}, OrdinaryDiffEq.DEOptions{Float64, Float64, Float64, Float64, PIController{Rational{Int64}}, typeof(DiffEqGPU.diffeqgpunorm), typeof(opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), DiffEqGPU.var"#12#18", DataStructures.BinaryMinHeap{Float64}, DataStructures.BinaryMinHeap{Float64}, Nothing, Nothing, Int64, Tuple{}, Tuple{}, Tuple{}}, Matrix{ComplexF64}, ComplexF64, Nothing, OrdinaryDiffEq.DefaultInit}, cache::OrdinaryDiffEq.Tsit5Cache{Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{ComplexF64}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/perform_step/low_order_rk_perform_step.jl:623
  [7] __init(prob::ODEProblem{Matrix{ComplexF64}, Tuple{Float64, Float64}, true, Matrix{Float64}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::Tsit5, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}, recompile::Type{Val{true}}; saveat::Tuple{}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Nothing, callback::Nothing, dense::Bool, calck::Bool, dt::Float64, dtmin::Nothing, dtmax::Float64, force_dtmin::Bool, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmin::Rational{Int64}, qmax::Int64, qsteady_min::Int64, qsteady_max::Int64, beta1::Nothing, beta2::Nothing, qoldinit::Rational{Int64}, controller::Nothing, fullnormalize::Bool, failfactor::Int64, maxiters::Int64, internalnorm::typeof(DiffEqGPU.diffeqgpunorm), internalopnorm::typeof(opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::DiffEqGPU.var"#12#18", verbose::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias_u0::Bool, alias_du0::Bool, initializealg::OrdinaryDiffEq.DefaultInit, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/solve.jl:456
  [8] #__solve#466
    @ ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/solve.jl:4 [inlined]
  [9] #solve_call#58
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:61 [inlined]
 [10] solve_up(prob::ODEProblem{Matrix{ComplexF64}, Tuple{Float64, Float64}, true, Matrix{Float64}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, sensealg::Nothing, u0::Matrix{ComplexF64}, p::Matrix{Float64}, args::Tsit5; kwargs::Base.Iterators.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:unstable_check, :callback, :merge_callbacks, :internalnorm), Tuple{DiffEqGPU.var"#12#18", Nothing, Bool, typeof(DiffEqGPU.diffeqgpunorm)}}})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:82
 [11] #solve#59
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:70 [inlined]
 [12] batch_solve_up(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::Tsit5, ensemblealg::EnsembleCPUArray, I::UnitRange{Int64}, u0::Matrix{ComplexF64}, p::Matrix{Float64}; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:319
 [13] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleCPUArray, I::UnitRange{Int64}; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:284
 [14] macro expansion
    @ ./timing.jl:287 [inlined]
 [15] __solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleCPUArray; trajectories::Int64, batch_size::Int64, unstable_check::Function, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:201
 [16] #solve#61
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:96 [inlined]
 [17] top-level scope
    @ ./timing.jl:210 [inlined]
 [18] top-level scope
    @ ./In[11]:0
 [19] eval
    @ ./boot.jl:360 [inlined]
 [20] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094

Jul 13 '21 01:07 albertomercurio

Yes, I don't think it can generate generic linear algebra calls.

Jul 13 '21 01:07 ChrisRackauckas

Did you try using StaticArrays?

Jul 13 '21 01:07 ChrisRackauckas

Did you try using StaticArrays?

I'm using very large matrices (usually 1500x1500), this is the reason i use sparse matrices. However, also with a 100x100 matrix it tooks a lot of time to convert L to a Static Array

SMatrix{N, N}(Matrix(L))

Jul 13 '21 11:07 albertomercurio

That makes no sense to solve with EnsembleGPUArray. Read the DiffEqGPU documentation on the two forms of GPU problems. You're using the form for 100 ODEs or less, not the one for 1000 ODEs or more.

https://github.com/SciML/DiffEqGPU.jl#within-method-gpu-parallelism-with-direct-cuarray-usage

That's more applicable here.

Jul 13 '21 11:07 ChrisRackauckas

Do you mean that I need to convert L and L_t into CuArray and than evolve each ODE serially with EnsembleSerial()? If it is the case, can i use EnsembleThreads() to parallelize that CUDA ODE again?

Jul 13 '21 13:07 albertomercurio

That would evolve the ODE in parallel on the GPU. You want to use EnsembleSerial unless you have multiple GPUs which you would want to use at the same time.

Jul 13 '21 18:07 ChrisRackauckas

Ok perfect. Thank you for your help.

However I have the following question. I'm yet able to perform this ODE directly on the GPU, using the differential function

L = CUDA.CUSPARSE.CuSparseMatrixCSR( L )
L_t = CUDA.CUSPARSE.CuSparseMatrixCSR( L_t )

function f(rho, p, t)
    A, w_l = p
    rho_tmp = similar(rho0_vec)
    return L * rho + CUDA.CUSPARSE.mv!('N', A * sin(w_l * t), L_t, rho, 0.0, rho_tmp, '0')
end

where this time L and L_t are CuSparse matrices, and they don't take up a lot of memory as the dense matrix form. This method is a little bit faster than the CPU sparse one, but of course slower if I parallelize the CPU method in the Ensemble Simulation. Whit this CUDA sparse method the GPU is loaded only up to the 10%, and so the memory. If I could be able to parallelize all the ODEs in the ensemble I think it would get a lot faster than the EnsembleThreads() method.

Why there is no way to implement this method for an EnsembleGPUArray() simulation?

Jul 14 '21 14:07 albertomercurio

To GPU it like this, you'd want to expand it out to its scalar form with something like ModelingToolkit, in order to then do the GPU codegen. EnsembleGPUArray is all about kernel code generation, so it's not using CUDA kernels like those from CUSPARSE since it's really made to generate nonlinear kernels.

Jul 19 '21 22:07 ChrisRackauckas

DiffEqGPU.jl DiffEqGPU.jl copied to clipboard

GPU Parallel Ensemble Simulations with generic linear algebra

DiffEqGPU.jl
DiffEqGPU.jl copied to clipboard