ExpmV.jl
ExpmV.jl copied to clipboard
Optimize using in-place math; small cleanups
I was recently using ExmpV.jl for solving a Lindblad master equation for a time-independent system of coupled harmonic oscillators, i.e. a very sparse A
that is constant over many exmpv
iterations. In this use case, the loop part of expmv_fun
was definitely the bottleneck for my whole application, given that I could pre-compute the select_taylor_degree
result.
This PR changes the code to explicitly allocate some copies of vectors before the loop and then use in-place operations to avoid excessive GC usage (according to @time
, my program churned through something like 2 TB of GC memory in a couple of minutes). It also replaces the calculation of the infinity norm by the All in all, this leads to several times of performance gain in my use case.iamax
BLAS function.
There is likely still quite a bit more to optimize here, both in terms of performance and code style. I'm new to Julia, so I'd appreciate any pointers.
In fact, my simulation is so much limited by the sparse-matrix-dense-vector multiplication inside the expmv_fun
loop that I hacked together a patch to use the SuiteSparse
/CHOLMOD
implementation of that for another small speed gain (see my fork). It's really ugly at this point, though, and should really be fixed in Base
instead.
Thanks for the contribution! I'll look at the code a bit more carefully before merging, but it looks interesting.
I also have a branch where I properly implement 1-norm estimation, but I am running into what appears to be numerical instabilities for sufficiently large dimensions and density, but once that is sorted out it should lead to big performance gains.
Going to submit the iamax
optimization to Base
.
Regarding the 1-norm estimation: This would definitely be useful for me in the future, but since I can cache the select_taylor_degree
call for a number of iterations right now, it is currently not super important performance-wise.
D'oh, forgot that the notion of the absolute value of a complex number in BLAS is abs(re) + abs(im)
instead of the mathematical definition. The iamax
optimization is thus not valid for complex numbers. :/
Is there any reason why this PR stalled?
Lack of time.
@ChrisRackauckas: Your points are valid, but changing the API is different is a whole separate story from the transparent optimisations I did last year.
I've updated the benchmarking code, and it is not entirely clear if the changes you made make much of a difference. That being said, the cleanups are probably improvements, so I will look at it a bit more carefully, and will likely merge your changes.
This has been a long time ago, so I'm a bit hazy on the details. IIRC you should see a massive reduction in the number of GC allocations at least for precomputed select_taylor_degree
.
For context, below is some code I quickly threw together last year while prototyping the physics code I mentioned. (Yes, the code quality is quite horrible, but it was a quick hack trying to figure out how to translate the physics into reasonably performant Julia before writing the actual thing.) The Qu* stuff is just a thin wrapper around complex Arrays; in the end, times
is 2001 (i.e. 2000 iterations with the same select_taylor_degree
), and l is a 40000x40000 complex matrix with sparsity about 1e-4.
using ExpmV
using QuBase
using QuDynamics
super_pre_post(pre, post) = kron(transpose(post), pre)
dissipator(a) = super_pre_post(a, a') - 0.5 * super_pre_post(a'a, speye(size(a)...)) - 0.5 * super_pre_post(speye(size(a)...), a'a)
super_hamiltonian(h) = -1im * (super_pre_post(h, speye(size(h)...)) - super_pre_post(speye(size(h)...), h))
function lindblad_op(h, collapse_ops::Vector)
super = super_hamiltonian(coeffs(h))
for c_op in collapse_ops
super += dissipator(coeffs(c_op))
end
super
end
function sim(n)
Δ = 1e-5 * 2π
ϵ = 0.5 * 1e-2 #2π
κ = 15 * 2π
k = -0.7κ/(2 * sqrt(3))
times = 0.0:10:20000
ψi = complex(statevec(2, FiniteBasis(n)))
a = lowerop(n)
h = Δ * a'a + ϵ * (a' + a) # + k * (a' * a' + a * a)
l = lindblad_op(h, [sqrt(κ) * a])
ρ = ψi * ψi'
dims_ρ = size(ρ)
Type_ρ = QuBase.similar_type(ρ)
bases_ρ = bases(ρ)
ρvec = coeffs(vec(full(ρ)))
nbars = Array(Float64, 0)
@time precalc = ExpmV.select_taylor_degree(l, 1)
for (t0, t1) in zip(times[1 : end-1], times[2:end])
ρvec = ExpmV.expmv(t1 - t0, l, ρvec, M=precalc)
ρ = Type_ρ(reshape(ρvec, dims_ρ), bases_ρ)
push!(nbars, real(trace(ρ * a'a)))
end
return times[2:end], nbars, ρ
end
@time times, nbars, ρfin = sim(200)
(typeof(l),size(l),nnz(l)) = (SparseMatrixCSC{Complex{Float64},Int64},(40000,40000),199000)
Thanks, that helps quite a bit. Under the new benchmarking I can see your pull request reducing median memory usage by ~50% and median runtime by ~10%, so it does seem to lead to significant improvements.
On Sat, Nov 26, 2016, 4:33 PM David Nadlinger [email protected] wrote:
This has been a long time ago, so I'm a bit hazy on the details. IIRC you should see a massive reduction in the number of GC allocations at least for precomputed select_taylor_degree.
For context, below is some code I quickly threw together last year while prototyping the physics code I mentioned. (Yes, the code quality is quite horrible, but it was a quick hack trying to figure out how to translate the physics into reasonably performant Julia before writing the actual thing.) The Qu* stuff is just a thin wrapper around complex Arrays; in the end, times is 2001 (i.e. 2000 iterations with the same select_taylor_degree), and l is a 40000x40000 complex matrix.
using ExpmVusing QuBaseusing QuDynamics super_pre_post(pre, post) = kron(transpose(post), pre) dissipator(a) = super_pre_post(a, a') - 0.5 * super_pre_post(a'a, speye(size(a)...)) - 0.5 * super_pre_post(speye(size(a)...), a'a) super_hamiltonian(h) = -1im * (super_pre_post(h, speye(size(h)...)) - super_pre_post(speye(size(h)...), h)) function lindblad_op(h, collapse_ops::Vector) super = super_hamiltonian(coeffs(h)) for c_op in collapse_ops super += dissipator(coeffs(c_op)) end superend function sim(n) Δ = 1e-5 * 2π ϵ = 0.5 * 1e-2 #2π κ = 15 * 2π k = -0.7κ/(2 * sqrt(3)) times = 0.0:10:20000
ψi = complex(statevec(2, FiniteBasis(n))) a = lowerop(n) h = Δ * a'a + ϵ * (a' + a) # + k * (a' * a' + a * a) l = lindblad_op(h, [sqrt(κ) * a]) ρ = ψi * ψi' dims_ρ = size(ρ) Type_ρ = QuBase.similar_type(ρ) bases_ρ = bases(ρ) ρvec = coeffs(vec(full(ρ))) nbars = Array(Float64, 0) @time precalc = ExpmV.select_taylor_degree(l, 1) for (t0, t1) in zip(times[1 : end-1], times[2:end]) ρvec = ExpmV.expmv(t1 - t0, l, ρvec, M=precalc) ρ = Type_ρ(reshape(ρvec, dims_ρ), bases_ρ) push!(nbars, real(trace(ρ * a'a))) end return times[2:end], nbars, ρend
@time times, nbars, ρfin = sim(200)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/marcusps/ExpmV.jl/pull/6#issuecomment-263087229, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOKcWKsdnYzWJtVbD3FHz8RhB0CLmgjks5rCKXAgaJpZM4GUB-9 .
I just had a quick glance over the benchmarks, and it seems like you don't have any for the precomputed case? It's been a year since I looked at the time profiles, but IIRC select_taylor_degree
would be quite expensive for the above use case if it wasn't for the 2000 iterations.
That's right, I haven't gotten around to the precomputed case yet, but there is already some gain.
On Sat, Nov 26, 2016, 10:49 PM David Nadlinger [email protected] wrote:
I just had a quick glance over the benchmarks, and it seems like you don't have any for the precomputed case? It's been a year since I looked at the time profiles, but IIRC select_taylor_degree would be quite expensive for the above use case if it weren't for the 2000 iterations.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/marcusps/ExpmV.jl/pull/6#issuecomment-263100670, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOKceeNG5to9qJaTfKsJKtW9gorCnZUks5rCP2xgaJpZM4GUB-9 .
Ah, sure – just wanted to make sure it was clear enough in which scenario the "several times speedup" remark from my original post applies.