Support pre-allocating result storage
Watching memory usage go up and hoping I don't run out of memory has been a common experience during long running simulations. Quality of life issues aside, support for GPUs will pretty much require that memory is allocated upfront to some extent. The low hanging fruit for this is the case when dense=false and saveat is specified, which trivially determines the total storage requirements. I don't have a good API proposal for the more general case, but perhaps we can get a discussion started.
Note that when saveat is specified, dense=false is automatically set.
But yes, this is something we need to re-design. There is currently something in place for this:
https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/master/src/solve.jl#L13-L15
It's undocumented and not officially part of the API. Essentially what you want to pre-allocate is u, t, and k. u and t people understand. k is the data for the interpolations and is algorithm-dependent and undocumented. I wonder if it should all instead just have some cache generation function and a single type to throw around. It should definitely be a keyword argument.
A cache generation function and something like preallocate=false as a keyword argument to the common solver interface would be great. I don't know anything about the interpolation part, but at least in the saveat case where interpolation is irrelevant this should be a pretty easy add, right? Any other complexities I'm glossing over?
Nope that's it
What's the state of this now @ChrisRackauckas? I find that the saving by dense or saveat can be an annoying source of allocations and unpredictable GC runtimes when solving many ODEs. It seems like saving always allocates a new vector for u at every step.
Is there a smarter strategy? Would it be possible to allocate batches of, say, 100 vectors at a time in 1 operation? Or let users preallocate vectors to hold the solutions?
It is possible to emulate something like this using callbacks, e.g.:
using OrdinaryDiffEq, BenchmarkTools, Test
# 1) allocate inside solver at every time step
function f(du, u, p, t)
for i in eachindex(u)
du[i] = u[i]
end
end
u0 = [Float64(i) for i in 1:200] # [1.0, 2.0, ..., 200.0]
tspan = (0.0, 1.0)
abstol = reltol = 1e-20 # give many steps
prob1 = ODEProblem(f, u0, tspan)
b1 = @benchmark begin
sol1 = solve(prob1, Tsit5(); abstol, reltol, dense = false)
end seconds = 10
# 2) preallocate output once before solver
condition = (u, p, t) -> true # trigger at every time step
save! = integrator -> begin
# save u
out = only(integrator.p)
@inbounds out[:, integrator.iter] .= integrator.u
end
callback = DiscreteCallback(condition, save!)
out = zeros(length(u0), length(sol1)-1) # except initial time
prob2 = ODEProblem(f, u0, tspan, (out,))
b2 = @benchmark begin
sol2 = solve(prob2, Tsit5(); abstol, reltol, callback, save_start = false, save_end = false, save_everystep = false, save_on = false, dense = false)
end seconds = 10
#@test isequal(sol1[:, 2:end][:, :], out) # passes
BenchmarkTools.Trial: 243 samples with 1 evaluation per sample.
Range (min … max): 25.540 ms … 98.649 ms ┊ GC (min … max): 0.00% … 70.57%
Time (median): 40.364 ms ┊ GC (median): 14.39%
Time (mean ± σ): 41.183 ms ± 9.586 ms ┊ GC (mean ± σ): 14.67% ± 13.46%
▁ ▁ ▁ ▁▁▂ ▄▇█▄▇▃▆▃ ▂
▅█▆██▆▇█████▅▅████████▄█▅█▆▅█▆▇▇▄▃▁▆▅▄▄▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃ ▄
25.5 ms Histogram: frequency by time 76.7 ms <
Memory estimate: 44.37 MiB, allocs estimate: 54955.
BenchmarkTools.Trial: 365 samples with 1 evaluation per sample.
Range (min … max): 24.730 ms … 38.665 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 26.738 ms ┊ GC (median): 0.00%
Time (mean ± σ): 27.428 ms ± 2.262 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▃█▄▄▆▂▂
▄▃▅▆████████▇█▆▆▅▆▄▅▂▆▄▃▅▃▂▂▂▂▃▁▃▃▂▂▂▂▁▁▃▄▂▁▁▁▁▂▁▁▁▂▁▁▁▃▁▂▂ ▃
24.7 ms Histogram: frequency by time 36.7 ms <
Memory estimate: 28.19 KiB, allocs estimate: 63.
This doesn't include the one big preallocation in the 2nd case. For some applications that makes sense: to preallocate before solving (multiple) ODEs. In that case it is a bit faster and the timings are more consistent without invoking the GC.
We should extend the new alias argument struct to have these pieces and that is the form that should get documented