DiffEqGPU.jl
DiffEqGPU.jl copied to clipboard
`EnsembleGPUKernel `+ Texture Memory Support
@ChrisRackauckas @utkarsh530
Per our conversation here is an example demonstrating how texture memory interpolation could be use.
I would like to be able to leverage CUDA.jl's texture memory support for interpolation of data in the EOM and/or in a callback. A use case could be dropping a ball in a wind field with ground impact termination for a non-flat terrain. Here, one would want to interpolate the wind field as a function of state in the eom as a forcing term and an elevation map as a function of altitude.
Below is an initial prototype. This includes a CPU implementation that leverages DataInterpolations.jl to demonstrate the functionality desired using this data forecast.txt I also included an initial non-working prototype using texture memory.
No interpolation
Working model for CPU and GPU w/o interpolation
using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra
import DataInterpolations
const DI = DataInterpolations
# Ballistic model
function ballistic(u, p, t, weather)
CdS, mass, g = p
vel = @view u[4:6]
wind, ρ = get_weather(weather, u[3])
airvelocity = vel - wind
airspeed = norm(airvelocity)
accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)
return SVector{6}(vel..., accel...)
end
# constant wind and airdensity.
function get_weather(weather, z)
wind = SVector{3}(1f0, 0f0, 0f0)
ρ = 1.27f0
wind, ρ
end
trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)
eom_nowind = (u, p, t) -> ballistic(u,p,t,nothing)
prob = ODEProblem{false}(eom_nowind, u0, tspan, p)
sol = solve(prob, Tsit5())
prob_func = (prob, i, repeat) -> remake(prob, p = p + SVector{3}(rand(CdS_dist), 0f0, 0f0))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol = solve(eprob, Tsit5(); trajectories, saveat)
esol_gpu = solve(eprob, GPUTsit5(), EnsembleGPUKernel(); trajectories, saveat)
DataInterpolations.jl CPU Example
This demonstrates the basic capability I would like to replicate in w/ EnsembleGPUKernel
using CUDA.CuTexture
# Ballistic w/ Interpolated winds via DataInterpolations.jl
data = deserialize("forecast.txt")
N = length(data.altitude)
weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
SVector{4}(alt, wx, wy, ρ)
end
interp = DI.LinearInterpolation(weather_sa,data.altitude)
function get_weather(itp::DI.LinearInterpolation, z)
weather = itp(u[3])
wind = SVector{3}(weather[2], weather[3], 0f0)
ρ = weather[4]
wind, ρ
end
eom_di = (u,p,t) -> ballistic(u,p,t,interp)
prob = ODEProblem{false}(eom_di, u0, tspan, p)
sol = solve(prob, Tsit5())
prob_func = (prob, i, repeat) -> remake(prob, p = p + SVector{3}(rand(dist), 0f0, 0f0))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol = solve(eprob, Tsit5(); trajectories, saveat)
GPU Texture Interpolation Validation
Demonstrate usage of CuTexture for interpolation. Note, here I index into the texture memory by broadcasting over a CuArray{Float32}
of indices via dst_gpu .= getindex.(Ref(texture), idx_tlu)
weather = map(weather_sa) do w
(w...,)
end
weather_TA = CuTextureArray(weather)
texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())
## Test Texture interpolation
idx = LinRange(0f0, 1f0, 4000)
idx_gpu = CuArray(idx)
idx_tlu = (1f0-1f0/N)*idx_gpu .+ 0.5f0/N # normalized table lookup form https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx))
dst_gpu .= getindex.(Ref(texture), idx_tlu) # interpolation ℝ->ℝ⁴
begin
p1 = scatter(data.altitude, data.altitude, ylabel = "Altitude", label = "Raw", ms = 2, marker = :x)
plot!(idx*data.altitude[end], getindex.(dst_cpu, 1), label = "Texture",lw = 2)
p2 = scatter(data.altitude, data.windx, ylabel = "Wind X", label = "Raw", ms = 2, marker = :x, leg = false)
plot!(idx*data.altitude[end], getindex.(dst_cpu, 2), label = "Texture",lw = 2)
p3 = scatter(data.altitude, data.windy, ylabel = "Wind Y", label = "Raw", ms = 2, marker = :x, leg = false)
plot!(idx*data.altitude[end], getindex.(dst_cpu, 3), label = "Texture",lw = 2)
p4 = scatter(data.altitude, data.density, ylabel = "Density", label = "Raw", ms = 2, marker = :x, leg = false)
plot!(idx*data.altitude[end], getindex.(dst_cpu, 4), label = "Texture",lw = 2)
plot(p1, p2, p3, p4, link=:x, xlabel = "Altitude")
end
EnsembleGPUKernel
+ CuTexture
prototype
Non-working prototype. Note here the get_weather
function is indexing the texture at a single index for a single trajectory which isn't supported by CUDA.jl
. Although this is scalar indexing it should actually be occurring for each trajectory in the ensemble.
function get_weather(tex::CuTexture, z, zmax = data.altitude[end], N = length(data.altitude))
idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
tex[idx]
end
eom_tex = (u,p,t) -> ballistic(u,p,t,texture)
prob = ODEProblem{false}(eom_tex, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = p + SVector{3}(rand(dist), 0f0, 0f0))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol_gpu = solve(eprob, GPUTsit5(), EnsembleGPUKernel(); trajectories, saveat) #InvalidIR
ContinousCallback Prototype
The above example only does interpolation in the eom. However, interpolation could also occur in evaluating a callback. e.g. something like this
terrain_texture = CuTexture(...) # ℝ² -> ℝ, map xy position to ground elevation
condition(u,t,integrator) = u[3] - terrain_texture[@view u[1:2]]. # check height above elevation
cb = ContinuousCallback(condition, terminate!)
This looks great!
The texture interpolation results should like this

EnsembleGPUKernel
+ DataInterpolations.jl
As a first attempt to try out interpolation within GPU ODE solves, I got the MWE working with EnsembleGPUKernel
+ DataInterpolations.jl
.
using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra
import DataInterpolations
const DI = DataInterpolations
## CPU Working Example
# Ballistic model
function ballistic(u, p, t, weather)
CdS, mass, g = p
vel = @view u[4:6]
wind, ρ = get_weather(weather, u[3])
airvelocity = vel - wind
airspeed = norm(airvelocity)
accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)
return SVector{6}(vel..., accel...)
end
trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)
# Ballistic w/ Interpolated winds via DataInterpolations.jl
data = deserialize("forecast.txt")
N = length(data.altitude)
weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
SVector{4}(alt, wx, wy, ρ)
end
weather_sa = SVector{length(weather_sa)}(weather_sa)
interp = DI.LinearInterpolation{true}(hcat(weather_sa...),data.altitude)
function get_weather(itp::DI.LinearInterpolation, z)
weather = itp(z)
wind = SVector{3}(weather[2], weather[3], 0f0)
ρ = weather[4]
wind, ρ
end
eom_di = (u,p,t) -> ballistic(u,p,t,interp)
prob = ODEProblem{false}(eom_di, u0, tspan, p)
sol = solve(prob, Tsit5())
prob_func = (prob, i, repeat) -> remake(prob, p = p + SVector{3}(rand(CdS_dist), 0f0, 0f0))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol = solve(eprob, Tsit5(); trajectories, saveat)
### solving the ODE on GPU + Interpolation using DataInterpolations
function ballistic_gpu(u, p, t)
CdS, mass, g = p[1]
interp = p[2]
vel = @view u[4:6]
# tt = interp(0.1f0)[1]
# @cushow tt
# wind, ρ = get_weather(interp, u[3], zmax, N)
wind, ρ = get_weather(interp, u[3])
airvelocity = vel - wind
airspeed = norm(airvelocity)
accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)
return SVector{6}(vel..., accel...)
end
prob = ODEProblem(ballistic_gpu, u0, tspan, (p, interp))
prob_func = (prob, i, repeat) -> remake(prob, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), interp))
eprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol_gpu = solve(eprob, GPUTsit5(), EnsembleGPUKernel(); trajectories, saveat)
Workaround: Using the default LinearInterpolation
object caused some dynamic function invocation. Hence, I changed the interp.u
to be a matrix instead of a vector of SVector
. Also, I clubbed the interp
buffer with parameters as using the lambda function caused some dynamic function invocation.
Attempt to use Texture Memory
I tried to modify the problem definition and investigated where dynamic invocation was happening. I had to change the building of the problem, as mentioned above. I tried to isolate some pieces here:
using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra
import DataInterpolations
const DI = DataInterpolations
trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)
###
data = deserialize("forecast.txt")
N = length(data.altitude)
weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
SVector{4}(alt, wx, wy, ρ)
end
weather = map(weather_sa) do w
(w...,)
end
weather_TA = CuTextureArray(weather)
texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())
## Test Texture interpolation
idx = LinRange(0f0, 1f0, 4000)
idx_gpu = CuArray(idx)
idx_tlu = (1f0-1f0/N)*idx_gpu .+ 0.5f0/N # normalized table lookup form https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx))
dst_gpu .= getindex.(Ref(texture), idx_tlu) # interpolation ℝ->ℝ⁴
def_zmax = data.altitude[end]
N = length(data.altitude)
@inline function get_weather(tex, z, zmax, N)
idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
weather = tex[idx]
wind = SVector{3}(weather[2], weather[3], 0f0)
ρ = weather[4]
wind, ρ
end
### Experimentation
function kernel(texture, idx)
tid = threadIdx().x + (blockIdx().x - 1) * blockDim().x
@cushow texture[idx[tid]][1]
return
end
@cuda kernel(texture, idx_tlu)
function ballistic_t(u, p, t)
CdS, mass, g = p[1]
interp = p[2]
zmax = p[3]
N = p[4]
vel = @view u[4:6]
tt = interp[zmax][1]
@cushow tt
wind, ρ = get_weather(interp, u[3], zmax, N)
airvelocity = vel - wind
airspeed = norm(airvelocity)
accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)
return SVector{6}(vel..., accel...)
end
prob = ODEProblem(ballistic_t, u0, tspan, (p, texture, def_zmax, N))
function kernel_prob(prob)
tt = prob.f(prob.u0, prob.p, 1.f0)
return
end
@cuda kernel_prob(prob) # Works
eprob = EnsembleProblem(prob, safetycopy = false)
esol_gpu = solve(eprob, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, adptive = false, dt = 0.001f0) #InvalidIR
After performing the solve, I get this error:
ERROR: InvalidIRError: compiling kernel #tsit5_kernel(CuDeviceVector{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{6, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Int64, Nothing, Val{true}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to (f::ODEFunction)(args...) in SciMLBase at /home/gridsan/utkarsh/.julia/packages/SciMLBase/QqtZA/src/scimlfunctions.jl:2096)
Stacktrace:
[1] tsit5_kernel
@ ~/.julia/dev/DiffEqGPU/src/perform_step/gpu_tsit5_perform_step.jl:81
This seems a bit strange to me; a simple kernel having prob
as a parameter just worked. So I'm not sure what's breaking here?. The dispatch to solve essentially is cu(probs)
, which is roughly mentioned here: https://docs.sciml.ai/DiffEqGPU/stable/tutorials/lower_level_api/. I can isolate more pieces here, but I'll need a different set of eyes to have a look at it.
Maybe @maleadt can you look at this and help figure out how to use Texture Memory with EnsembleGPUKernel
?
Inspecting this kernel with Cthulhu (using Cthulthu
, @device_code_warntype main()
, press h
to hide non-problematic code and navigate to the problematic code) reveals that the dynamic IR comes from the following call:
• %419 = invoke ODEFunction(::SVector{6, Float32},::Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64},::Float32)::Union{}
That immediately shows the problem: your kernels still contain CPU datastructures (CuTexture, CuTextureArray) which first need to be converted to their GPU counterparts (CuDeviceTextire). Normally this conversion happens automatically when passing such objects to a kernel. In the case of structs containing GPU objects you need to define Adapt rules. It seems that ODEProblem already does so, because the kernel_prob
invocation does already convert ODEProblem-contained GPU objects:
#kernel_prob#23(ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuDeviceTexture{NTuple{4, Float32}, 1, CUDA.ArrayMemory, true, CUDA.LinearInterpolation}, Float32, Int64}, ODEFunction{false, true, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem})
However, the problematic atsit5 invocation below passes a GPU vector of problems, and the CPU-to-GPU object conversion does not happen automatically for array elements (because it would otherwise require a download from GPU->CPU, perform the conversion there, allocate a new array, upload again; making GPU kernel launches unacceptably expensive):
#atsit5_kernel(CuDeviceVector{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, 1}, CuDeviceMatrix{SVector{6, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, Nothing, Val{false})
The simplest solution here would be to pass a tuple of ODEProblems, because for tuples we can do the conversion efficiently. I'm not sure where that change would be needed, but @utkarsh530 or @ChrisRackauckas probably know where this comes from.
Hmm, passing these ODE problems as a tuple isn't going to work because of their size:
using Adapt
# HACK: force a GPU array of ODE problems to be passed as a tuple
Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem}) =
tuple(adapt.(Ref(to), Array(x))...)
... yields uses too much parameter space (0x22c4 bytes, 0x1100 max)
. So we can keep the array, but as I mentioned that conversion is expensive:
function Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem})
# first convert the contained ODE problems
y = CuArray(adapt.(Ref(to), Array(x)))
# continue doing what the default method does
Base.unsafe_convert(CuDeviceArray{eltype(y),ndims(y),CUDA.AS.Global}, y)
end
And with that, the kernel launches and runs 🙂
The array of problems are built here: https://github.com/SciML/DiffEqGPU.jl/blob/70b0820e84c94e94fd9dcdf5c9b81c72be1e62ea/src/DiffEqGPU.jl#L598-L606, and the cu(probs)
dispatch is here: https://github.com/SciML/DiffEqGPU.jl/blob/70b0820e84c94e94fd9dcdf5c9b81c72be1e62ea/src/DiffEqGPU.jl#L690-L696
But yes, I understood your point as to why that wasn't working. We can definitely try to figure out how "expensive" it is, but it works fine currently
Here's the comparison with interp
as DataInterpolations.jl vs CUDA texture memory:
Script
using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra
import DataInterpolations
const DI = DataInterpolations
trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)
###
data = deserialize("forecast.txt")
N = length(data.altitude)
weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
SVector{4}(alt, wx, wy, ρ)
end
weather = map(weather_sa) do w
(w...,)
end
weather_TA = CuTextureArray(weather)
texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())
## Test Texture interpolation
idx = LinRange(0f0, 1f0, 4000)
idx_gpu = CuArray(idx)
idx_tlu = (1f0-1f0/N)*idx_gpu .+ 0.5f0/N # normalized table lookup form https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx))
dst_gpu .= getindex.(Ref(texture), idx_tlu) # interpolation ℝ->ℝ⁴
def_zmax = data.altitude[end]
N = length(data.altitude)
@inline function get_weather(tex, z, zmax, N)
idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
weather = tex[idx]
wind = SVector{3}(weather[2], weather[3], 0f0)
ρ = weather[4]
wind, ρ
end
### Experimentation
function ballistic_t(u, p, t)
CdS, mass, g = p[1]
interp = p[2]
zmax = p[3]
N = p[4]
vel = @view u[4:6]
wind, ρ = get_weather(interp, u[3], zmax, N)
airvelocity = vel - wind
airspeed = norm(airvelocity)
accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)
return SVector{6}(vel..., accel...)
end
prob = ODEProblem(ballistic_t, u0, tspan, (p, texture, def_zmax, N))
using Adapt
function Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem})
# first convert the contained ODE problems
y = CuArray(adapt.(Ref(to), Array(x)))
# continue doing what the default method does
Base.unsafe_convert(CuDeviceArray{eltype(y),ndims(y),CUDA.AS.Global}, y)
end
prob_func = (prob, i, repeat) -> remake(prob, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), texture, def_zmax, N))
eprob_texture = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol_gpu = solve(eprob_texture, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
## CPU Working Example
# Ballistic model
trajectories = 100
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 50.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
CdS_dist = Normal(0f0, 1f0)
# Ballistic w/ Interpolated winds via DataInterpolations.jl
data = deserialize("forecast.txt")
N = length(data.altitude)
weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
SVector{4}(alt, wx, wy, ρ)
end
weather_sa = SVector{length(weather_sa)}(weather_sa)
interp = DI.LinearInterpolation{true}(hcat(weather_sa...),data.altitude)
function get_weather(itp::DI.LinearInterpolation, z)
weather = itp(z)
wind = SVector{3}(weather[2], weather[3], 0f0)
ρ = weather[4]
wind, ρ
end
### solving the ODE on GPU + Interpolation using DataInterpolations
function ballistic_gpu(u, p, t)
CdS, mass, g = p[1]
interp = p[2]
vel = @view u[4:6]
wind, ρ = get_weather(interp, u[3])
airvelocity = vel - wind
airspeed = norm(airvelocity)
accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)
return SVector{6}(vel..., accel...)
end
prob_interp = ODEProblem(ballistic_gpu, u0, tspan, (p, interp))
prob_func = (prob, i, repeat) -> remake(prob_interp, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), interp))
eprob_interp = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
Benchmarking:
julia> @benchmark @CUDA.sync esol_gpu = solve(eprob_texture, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
BenchmarkTools.Trial: 705 samples with 1 evaluation.
Range (min … max): 5.657 ms … 34.047 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.443 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.090 ms ± 3.021 ms ┊ GC (mean ± σ): 0.81% ± 3.84%
▃▇█▇▄▁
███████▆▆▅▆▇▅▅▄▄▄▁▁▁▁▁▁▄▁▁▁▄▁▁▄▁▁▄▁▄▁▁▁▁▁▄▁▄▁▄▄▆▁▁▁▁▁▄▁▁▅▅ ▇
5.66 ms Histogram: log(frequency) by time 23.9 ms <
Memory estimate: 850.06 KiB, allocs estimate: 12149.
julia> @benchmark @CUDA.sync esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
BenchmarkTools.Trial: 393 samples with 1 evaluation.
Range (min … max): 9.966 ms … 48.706 ms ┊ GC (min … max): 0.00% … 48.42%
Time (median): 11.320 ms ┊ GC (median): 0.00%
Time (mean ± σ): 12.722 ms ± 5.161 ms ┊ GC (mean ± σ): 3.32% ± 8.46%
▅▇▇█▆
█████▅▇▅▇▆▄▄▁▁▅▅▄▁▁▄▆▄▅▁▁▁▄▁▄▁▁▄▁▁▁▄▄▄▄▄▁▄▄▄▄▆▄▄▁▁▁▄▄▁▁▁▁▄▅ ▆
9.97 ms Histogram: log(frequency) by time 35.7 ms <
Memory estimate: 4.84 MiB, allocs estimate: 39854.
@utkarsh530 is that the right script? There is no eprob_texture
in the script you included.
Sorry, I just updated it.
I am trying to test this on an A100 and I get the following on all the solves.
julia> esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
ERROR: UndefKeywordError: keyword argument dt not assigned
Stacktrace:
[1] batch_solve_up_kernel(ensembleprob::EnsembleProblem{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#15#16", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, DataInterpolations.LinearInterpolation{SMatrix{4, 64, Float32, 256}, LinRange{Float32, Int64}, true, Float32}}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_gpu), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel, I::UnitRange{Int64}, adaptive::Bool; kwargs::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#13#19", LinRange{Float32, Int64}}}})
@ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/DiffEqGPU.jl:382
[2] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#15#16", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel, I::UnitRange{Int64}, adaptive::Bool; kwargs::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#13#19", LinRange{Float32, Int64}}}})
@ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/DiffEqGPU.jl:345
[3] macro expansion
@ ./timing.jl:382 [inlined]
[4] __solve(ensembleprob::EnsembleProblem{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#15#16", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel; trajectories::Int64, batch_size::Int64, unstable_check::Function, adaptive::Bool, kwargs::Base.Pairs{Symbol, LinRange{Float32, Int64}, Tuple{Symbol}, NamedTuple{(:saveat,), Tuple{LinRange{Float32, Int64}}}})
@ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/DiffEqGPU.jl:254
[5] #solve#33
@ ~/.julia/packages/DiffEqBase/Lq1gG/src/solve.jl:851 [inlined]
[6] top-level scope
@ ~/GPUODEBenchmarks/GPU_ODE_SciML/Texture/wind.jl:132
nvm, up
took care of the issue
A100 results for comparison.
julia> @benchmark @CUDA.sync esol_gpu = solve(eprob_texture, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
BenchmarkTools.Trial: 849 samples with 1 evaluation.
Range (min … max): 5.573 ms … 33.261 ms ┊ GC (min … max): 0.00% … 69.47%
Time (median): 5.712 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.887 ms ± 1.619 ms ┊ GC (mean ± σ): 1.40% ± 4.25%
▃█▇▅▁ ▁▁▁
▃▆█████▆▃▃▄▃▅▅▅▇▇███▆▄▃▄▃▃▃▂▂▁▁▁▁▂▁▁▁▁▁▁▁▂▁▂▃▂▂▃▂▃▂▂▂▃▂▂▁▂ ▃
5.57 ms Histogram: frequency by time 6.53 ms <
Memory estimate: 848.19 KiB, allocs estimate: 12097.
julia> @benchmark @CUDA.sync esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
BenchmarkTools.Trial: 379 samples with 1 evaluation.
Range (min … max): 11.605 ms … 47.908 ms ┊ GC (min … max): 0.00% … 70.28%
Time (median): 12.987 ms ┊ GC (median): 0.00%
Time (mean ± σ): 13.203 ms ± 4.656 ms ┊ GC (mean ± σ): 4.67% ± 9.64%
▇ █
█▇█▄▄▁▁▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▅ ▆
11.6 ms Histogram: log(frequency) by time 47.1 ms <
Memory estimate: 4.84 MiB, allocs estimate: 39802.
@utkarsh530 I am trying to benchmark the Texture memory for different number of trajectories (above only uses 100) trajectories. However, I am noticing when doing GPUTsit5() for large numbers, e.g. 8388608. I have essentially 0% GPU usage according to watch -n 0.2 nvidia-smi
with random blips to ~25%, however I essentially have 100% CPU usage all the time. Is it possible that it is spending almost all its time constructing the solution on the CPU? If so, can that be multi-threaded?
Generally, that's the case with the EnsembleGPUKernel
that competes to solve on GPU is faster than converting back to CPU arrays + building SciMLSolution
. Can you try to use https://docs.sciml.ai/DiffEqGPU/dev/tutorials/lower_level_api/ ? It does not perform any expensive operations required on the CPU.
script
using Pkg; cd(@__DIR__); Pkg.activate(".")
using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra, Adapt
import DataInterpolations
const DI = DataInterpolations
function ballistic(u, p, t)
CdS, mass, g = p[1]
interp = p[2]
zmax = p[3]
N = p[4]
vel = @view u[4:6]
wind, ρ = get_weather(interp, u[3], zmax, N)
airvelocity = vel - wind
airspeed = norm(airvelocity)
accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)
return SVector{6}(vel..., accel...)
end
@inline function get_weather(tex, z, zmax, N)
idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
weather = tex[idx]
wind = SVector{3}(weather[2], weather[3], 0f0)
ρ = weather[4]
wind, ρ
end
@inline function get_weather(itp::DI.LinearInterpolation, z, zmax, N)
weather = itp(z)
wind = SVector{3}(weather[2], weather[3], 0f0)
ρ = weather[4]
wind, ρ
end
function Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem})
# first convert the contained ODE problems
y = CuArray(adapt.(Ref(to), Array(x)))
# continue doing what the default method does
Base.unsafe_convert(CuDeviceArray{eltype(y),ndims(y),CUDA.AS.Global}, y)
end
# build interpolants
data = deserialize(joinpath(@__DIR__,"data","forecast.txt"))
N = length(data.altitude)
def_zmax = data.altitude[end]
weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
SVector{4}(alt, wx, wy, ρ)
end
weather_sa = SVector{length(weather_sa)}(weather_sa)
interp = DI.LinearInterpolation{true}(hcat(weather_sa...),data.altitude)
weather = map(weather_sa) do w
(w...,)
end
weather_TA = CuTextureArray(weather)
texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())
### Simulation parameters
trajectories = 10_000
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 40.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
p_tx = (p, texture, def_zmax, N)
p_di = (p, interp, def_zmax, N)
CdS_dist = Normal(0f0, 1f0)
prob_func = (prob, i, repeat) -> remake(prob, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), prob.p[2:end]...))
### Texture Solve Test
# High Level
prob_tx = ODEProblem(ballistic, u0, tspan, p_tx)
eprob_tx = EnsembleProblem(prob_tx; prob_func, safetycopy = false)
@time esol_gpu = solve(eprob_tx, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
# Low Level
@time begin
probs_tex = map(1:trajectories) do i
prob_func(prob_tx, i, false)
end |> cu
ts,us = DiffEqGPU.vectorized_asolve(probs_tex, prob_tx, GPUTsit5(); saveat)
end
### DI Solve Test
# High Level
prob_di = ODEProblem(ballistic, u0, tspan, p_di)
eprob_di = EnsembleProblem(prob_di; prob_func, safetycopy = false)
@time esol_di = solve(eprob_di, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
# Low Level
@time begin
probs_di = map(1:trajectories) do i
prob_func(prob_di, i, false)
end |> cu
ts,us = DiffEqGPU.vectorized_asolve(probs_di, prob_di, GPUTsit5(); saveat)
end
# trajs = 8*4 .^(0:9)
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.samples = 3
trajs = 8*4 .^(0:8)
times = map(trajs[end]) do traj
@show traj
tx_lo = @benchmark @CUDA.sync begin
probs_tex = map(1:$traj) do i
prob_func(prob_tx, i, false)
end |> cu
ts,us = DiffEqGPU.vectorized_asolve(probs_tex, prob_tx, GPUTsit5(); $saveat)
end
display(tx_lo)
di_lo = @benchmark @CUDA.sync begin
probs_di = map(1:$traj) do i
prob_func(prob_di, i, false)
end |> cu
ts,us = DiffEqGPU.vectorized_asolve(probs_di, prob_di, GPUTsit5(); $saveat)
end
display(di_lo)
tx = @benchmark @CUDA.sync esol_gpu = solve(eprob_tx, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories = $traj, saveat = $saveat)
display(tx)
di = @benchmark @CUDA.sync esol_gpu = solve(eprob_di , GPUTsit5(), EnsembleGPUKernel(0.0); trajectories = $traj, saveat = $saveat)
display(di)
dicpu = @benchmark esol_cpu = solve(eprob_di , Tsit5(), EnsembleThreads(); trajectories = $traj, saveat = $saveat)
display(dicpu)
(tx = minimum(tx.times) / 1e6,
di = minimum(di.times) / 1e6,
dicpu = minimum(dicpu.times) / 1e6,
tx_lo = minimum(tx_lo.times) / 1e6,
di_lo = minimum(di_lo.times) / 1e6)
end
using Plots
begin
plt = plot(trajs, getindex.(times, :tx), label = "Texture", marker = :utriangle, legend = :topleft, yaxis = :log, xaxis=:log)
plot!(trajs, getindex.(times, :di), label = "Software", marker = :ltriangle)
plot!(trajs, getindex.(times, :dicpu), label = "CPU", marker = :square)
plot!(trajs, getindex.(times, :tx_lo), label = "Texture Low-Level", marker = :dtriangle, legend = :topleft)
plot!(trajs, getindex.(times, :di_lo), label = "Software Low-Level", marker = :rtriangle)
xlabel!("Trajectories")
ylabel!("(ms)")
plt
end

The low-level times includes the time to create and copy the probs. CPU times is using EnsembleThreads()
on a 12 core machine. GPU is A100 40GB.
For say 524288 trajectories, if I time just the solve with the low level interface, I get
@time ts,us = DiffEqGPU.vectorized_asolve(probs_tex, prob_tx, GPUTsit5(); saveat);
14.335337 seconds (51.38 M allocations: 1.992 GiB)
During this low-level solve I get 100% CPU usage on a single core and 0% usage on the GPU until right before the solve completes where it blips to ~30%.
However, the problematic atsit5 invocation below passes a GPU vector of problems, and the CPU-to-GPU object conversion does not happen automatically for array elements (because it would otherwise require a download from GPU->CPU, perform the conversion there, allocate a new array, upload again; making GPU kernel launches unacceptably expensive):
I am not sure, but the 100% CPU usage could be due to expensive GPU kernel launches requiring multiple uploads to GPU and CPU due to the reason Tim pointed out earlier. Even the scaling of the plot with trajectories seems a bit off, compared to plain ODE solves benchmarks. IMHO, we should only compare the time spent on solving the ODE rather than setup (which is generating GPU arrays of ODEProblem
being parallelized before the solve, similar to just generating an array of parameters before and uploading them on GPU. (A common setup in other programs)).
I think @maleadt might be able to comment better here.
After reviewing @maleadt's earlier comments, I wonder if makes sense to have a way to pass some parameters that are "global" to all ensembles or as Refs, so that the conversion needs to only happen once.
So, I've been experimenting with different options for passing the CuTexture
w/o incurring the huge conversion overhead and I am at a bit of a loss for what I am observing. Because the texture is constant for all problems in the ensemble, I would think that I should be able to just write an ode in the form eom(u, p, t, texture)
and do a closure over the texture. However, I still get a dynamic invocation with CuTexture
instead of CuDeviceTexture
.
NOTE: I am trying to avoid using the Adapt rule above due to the huge conversion overhead.
Here is a MWE demonstrating the issue
First, create a texture that is just 0 everywhere and verify that it can be used in a closure w/ proper conversions. This works as expected.
using CUDA, DiffEqGPU, OrdinaryDiffEq, StaticArrays, Adapt
data = map(1:5000) do w
(zeros(Float32, 4)...,)
end
texture = CuTexture( CuTextureArray(data); address_mode = CUDA.ADDRESS_MODE_CLAMP,
normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())
idx_gpu = CuArray(LinRange(0f0, 1f0, 4000))
dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx_gpu))
function interp!(dst, idx, tex)
dst .= getindex.(Ref(tex), idx)
end
cl_let = let tex = texture
(d,i) -> interp!(d, i, tex)
end
cl_let(dst_gpu, idx_gpu);
Next, lets define a simple ODE that accepts a texture as an argument but does nothing with it and solves over a closure of tex
function eom(u, p, t, tex)
return @SVector zeros(Float32, 4)
end
trajectories = 8192
u0 = @SVector zeros(Float32, 4)
tspan = (0.0f0, 40.0f0)
saveat = LinRange(tspan..., 100)
prob_func = (prob, i, repeat) -> remake(prob, u0 = @SVector randn(Float32, 4))
cl = let tex = texture
(x,p,t)->eom(x, p, t, tex)
end
prob = ODEProblem(cl, u0, tspan)
eprob = EnsembleProblem(prob; prob_func, safetycopy = false)
esol = solve(eprob, GPUTsit5(), EnsembleGPUKernel(0.0); adaptive= true, trajectories, saveat)
This solves with no issue. Note: no adapt rule was defined.
Next, lets solve similarly but with an ODE that uses the texture
function eom_tex(u, p, t, tex)
@inbounds w = tex[0.5] #NTuple of zeros
return SVector(w...)
end
cl_tex = let tex = texture
(x,p,t)->eom_tex(x, p, t, tex)
end
prob_tex = ODEProblem(cl_tex, u0, tspan)
eprob_tex = EnsembleProblem(prob_tex; prob_func, safetycopy = false)
esol_tex = solve(eprob_tex, GPUTsit5(), EnsembleGPUKernel(0.0); adaptive= true, trajectories, saveat)
leading to
julia> esol_tex = solve(eprob_tex, GPUTsit5(), EnsembleGPUKernel(0.0); adaptive= true, trajectories, saveat)
ERROR: InvalidIRError: compiling kernel #atsit5_kernel(CuDeviceVector{ODEProblem{false,SVector{4, Float32},Tuple{Float32, Float32},…}, 1}, CuDeviceMatrix{SVector{4, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, CuDeviceVector{Float32, 1}, Val{false}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to (f::ODEFunction)(args...) in SciMLBase
On inspection I see we still have a CuTexture
in the function call
• %2 = invoke eom_tex(::SVector{4, Float32},::SciMLBase.NullParameters,::Float32,::CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}})::Union{}
Why? It seems like the proper conversion occurred in the other examples using a closure over the texture. If I add the adapt rule, then it works. However it spends almost all the time converting. I don't understand why it is needed though if the texture is not an explicit parameter and is used in a closure instead.
Sorry for the slow response, I needed some time to catch up with my GH notifications 🙂
Why? It seems like the proper conversion occurred in the other examples using a closure over the texture.
That's because in those cases you were invoking a closure directly, and Adapt.jl has rules (Adapt.adapt_structure
methods) to recursively convert the captures of a closure. Those rules are then used by CUDA.jl to convert the function (closure) you're invoking, as well as any of its arguments, to a GPU-compatible representation (CuTexture->CuDeviceTexture, etc).
Here, however, a kernel is invoked with an EnsembleProblem argument, which contains an ODEProblem, which contains an ODEFunction, which contains a closure that captures a CuTexture and a CuTextureArray. Although CUDA will use Adapt to try and convert such an argument to a device-compatible representation, there are no Adapt rules defined for these types of objects, so the conversion is a no-op.
So basically, there would need to be Adapt rules for each of these types so that kernel conversion recurses into the objects when the EnsembleProblem is passed to a kernel. Alternatively, the code constructing an EnsembleProblem could manually call cudaconvert
on the fields that need to be converted, but depending on how that is done it would still need rules for all but the outermost object type (and also breaks platform portability).
One simple way to add Adapt rules is to use @adapt_structure
, but that relies on default constructors being present, and that's not the case for ODEFunction:
julia> Adapt.@adapt_structure EnsembleProblem
julia> Adapt.@adapt_structure ODEProblem
julia> Adapt.@adapt_structure ODEFunction
julia> cudaconvert(eprob_tex)
ERROR: MethodError: no method matching ODEFunction(::var"#14#15"{CuDeviceTexture{NTuple{4, Float32}, 1, CUDA.ArrayMemory, true, CUDA.LinearInterpolation}}, ::LinearAlgebra.UniformScaling{Bool}, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::typeof(SciMLBase.DEFAULT_OBSERVED), ::Nothing, ::Nothing)