ExpmV.jl icon indicating copy to clipboard operation
ExpmV.jl copied to clipboard

Optimize using in-place math; small cleanups

Open dnadlinger opened this issue 9 years ago • 13 comments

I was recently using ExmpV.jl for solving a Lindblad master equation for a time-independent system of coupled harmonic oscillators, i.e. a very sparse A that is constant over many exmpv iterations. In this use case, the loop part of expmv_fun was definitely the bottleneck for my whole application, given that I could pre-compute the select_taylor_degree result.

This PR changes the code to explicitly allocate some copies of vectors before the loop and then use in-place operations to avoid excessive GC usage (according to @time, my program churned through something like 2 TB of GC memory in a couple of minutes). It also replaces the calculation of the infinity norm by the iamax BLAS function. All in all, this leads to several times of performance gain in my use case.

There is likely still quite a bit more to optimize here, both in terms of performance and code style. I'm new to Julia, so I'd appreciate any pointers.

dnadlinger avatar Oct 22 '15 19:10 dnadlinger

In fact, my simulation is so much limited by the sparse-matrix-dense-vector multiplication inside the expmv_fun loop that I hacked together a patch to use the SuiteSparse/CHOLMOD implementation of that for another small speed gain (see my fork). It's really ugly at this point, though, and should really be fixed in Base instead.

dnadlinger avatar Oct 22 '15 19:10 dnadlinger

Thanks for the contribution! I'll look at the code a bit more carefully before merging, but it looks interesting.

I also have a branch where I properly implement 1-norm estimation, but I am running into what appears to be numerical instabilities for sufficiently large dimensions and density, but once that is sorted out it should lead to big performance gains.

marcusps avatar Oct 23 '15 20:10 marcusps

Going to submit the iamax optimization to Base.

Regarding the 1-norm estimation: This would definitely be useful for me in the future, but since I can cache the select_taylor_degree call for a number of iterations right now, it is currently not super important performance-wise.

dnadlinger avatar Oct 27 '15 08:10 dnadlinger

D'oh, forgot that the notion of the absolute value of a complex number in BLAS is abs(re) + abs(im) instead of the mathematical definition. The iamax optimization is thus not valid for complex numbers. :/

dnadlinger avatar Oct 30 '15 23:10 dnadlinger

Is there any reason why this PR stalled?

ChrisRackauckas avatar Nov 19 '16 01:11 ChrisRackauckas

Lack of time.

marcusps avatar Nov 19 '16 03:11 marcusps

@ChrisRackauckas: Your points are valid, but changing the API is different is a whole separate story from the transparent optimisations I did last year.

dnadlinger avatar Nov 19 '16 18:11 dnadlinger

I've updated the benchmarking code, and it is not entirely clear if the changes you made make much of a difference. That being said, the cleanups are probably improvements, so I will look at it a bit more carefully, and will likely merge your changes.

marcusps avatar Nov 26 '16 20:11 marcusps

This has been a long time ago, so I'm a bit hazy on the details. IIRC you should see a massive reduction in the number of GC allocations at least for precomputed select_taylor_degree.

For context, below is some code I quickly threw together last year while prototyping the physics code I mentioned. (Yes, the code quality is quite horrible, but it was a quick hack trying to figure out how to translate the physics into reasonably performant Julia before writing the actual thing.) The Qu* stuff is just a thin wrapper around complex Arrays; in the end, times is 2001 (i.e. 2000 iterations with the same select_taylor_degree), and l is a 40000x40000 complex matrix with sparsity about 1e-4.

using ExpmV
using QuBase
using QuDynamics

super_pre_post(pre, post) = kron(transpose(post), pre)

dissipator(a) = super_pre_post(a, a') - 0.5 * super_pre_post(a'a, speye(size(a)...)) - 0.5 * super_pre_post(speye(size(a)...), a'a)

super_hamiltonian(h) = -1im * (super_pre_post(h, speye(size(h)...)) - super_pre_post(speye(size(h)...), h))

function lindblad_op(h, collapse_ops::Vector)
    super = super_hamiltonian(coeffs(h))
    for c_op in collapse_ops
       super += dissipator(coeffs(c_op))
    end
    super
end

function sim(n)
    Δ = 1e-5 * 2π
    ϵ = 0.5 * 1e-2 #2π
    κ = 15 * 2π
    k = -0.7κ/(2 * sqrt(3))
    times = 0.0:10:20000

    ψi = complex(statevec(2, FiniteBasis(n)))
    a = lowerop(n)
    h = Δ * a'a + ϵ * (a' + a) # + k * (a' * a' + a * a)
    l = lindblad_op(h, [sqrt(κ) * a])
    
    ρ = ψi * ψi'
    dims_ρ = size(ρ)
    Type_ρ = QuBase.similar_type(ρ)
    bases_ρ = bases(ρ)
    ρvec = coeffs(vec(full(ρ)))

    nbars = Array(Float64, 0)
    @time precalc = ExpmV.select_taylor_degree(l, 1)
    for (t0, t1) in zip(times[1 : end-1], times[2:end])
        ρvec = ExpmV.expmv(t1 - t0, l, ρvec, M=precalc)
        ρ = Type_ρ(reshape(ρvec, dims_ρ), bases_ρ)
        push!(nbars, real(trace(ρ * a'a)))
    end

    return times[2:end], nbars, ρ
end

@time times, nbars, ρfin = sim(200)
(typeof(l),size(l),nnz(l)) = (SparseMatrixCSC{Complex{Float64},Int64},(40000,40000),199000)

dnadlinger avatar Nov 26 '16 21:11 dnadlinger

Thanks, that helps quite a bit. Under the new benchmarking I can see your pull request reducing median memory usage by ~50% and median runtime by ~10%, so it does seem to lead to significant improvements.

On Sat, Nov 26, 2016, 4:33 PM David Nadlinger [email protected] wrote:

This has been a long time ago, so I'm a bit hazy on the details. IIRC you should see a massive reduction in the number of GC allocations at least for precomputed select_taylor_degree.

For context, below is some code I quickly threw together last year while prototyping the physics code I mentioned. (Yes, the code quality is quite horrible, but it was a quick hack trying to figure out how to translate the physics into reasonably performant Julia before writing the actual thing.) The Qu* stuff is just a thin wrapper around complex Arrays; in the end, times is 2001 (i.e. 2000 iterations with the same select_taylor_degree), and l is a 40000x40000 complex matrix.

using ExpmVusing QuBaseusing QuDynamics super_pre_post(pre, post) = kron(transpose(post), pre) dissipator(a) = super_pre_post(a, a') - 0.5 * super_pre_post(a'a, speye(size(a)...)) - 0.5 * super_pre_post(speye(size(a)...), a'a) super_hamiltonian(h) = -1im * (super_pre_post(h, speye(size(h)...)) - super_pre_post(speye(size(h)...), h)) function lindblad_op(h, collapse_ops::Vector) super = super_hamiltonian(coeffs(h)) for c_op in collapse_ops super += dissipator(coeffs(c_op)) end superend function sim(n) Δ = 1e-5 * 2π ϵ = 0.5 * 1e-2 #2π κ = 15 * 2π k = -0.7κ/(2 * sqrt(3)) times = 0.0:10:20000

ψi = complex(statevec(2, FiniteBasis(n)))
a = lowerop(n)
h = Δ * a'a + ϵ * (a' + a) # + k * (a' * a' + a * a)
l = lindblad_op(h, [sqrt(κ) * a])

ρ = ψi * ψi'
dims_ρ = size(ρ)
Type_ρ = QuBase.similar_type(ρ)
bases_ρ = bases(ρ)
ρvec = coeffs(vec(full(ρ)))

nbars = Array(Float64, 0)
@time precalc = ExpmV.select_taylor_degree(l, 1)
for (t0, t1) in zip(times[1 : end-1], times[2:end])
    ρvec = ExpmV.expmv(t1 - t0, l, ρvec, M=precalc)
    ρ = Type_ρ(reshape(ρvec, dims_ρ), bases_ρ)
    push!(nbars, real(trace(ρ * a'a)))
end

return times[2:end], nbars, ρend

@time times, nbars, ρfin = sim(200)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/marcusps/ExpmV.jl/pull/6#issuecomment-263087229, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOKcWKsdnYzWJtVbD3FHz8RhB0CLmgjks5rCKXAgaJpZM4GUB-9 .

marcusps avatar Nov 27 '16 03:11 marcusps

I just had a quick glance over the benchmarks, and it seems like you don't have any for the precomputed case? It's been a year since I looked at the time profiles, but IIRC select_taylor_degree would be quite expensive for the above use case if it wasn't for the 2000 iterations.

dnadlinger avatar Nov 27 '16 03:11 dnadlinger

That's right, I haven't gotten around to the precomputed case yet, but there is already some gain.

On Sat, Nov 26, 2016, 10:49 PM David Nadlinger [email protected] wrote:

I just had a quick glance over the benchmarks, and it seems like you don't have any for the precomputed case? It's been a year since I looked at the time profiles, but IIRC select_taylor_degree would be quite expensive for the above use case if it weren't for the 2000 iterations.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/marcusps/ExpmV.jl/pull/6#issuecomment-263100670, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOKceeNG5to9qJaTfKsJKtW9gorCnZUks5rCP2xgaJpZM4GUB-9 .

marcusps avatar Nov 27 '16 20:11 marcusps

Ah, sure – just wanted to make sure it was clear enough in which scenario the "several times speedup" remark from my original post applies.

dnadlinger avatar Nov 27 '16 20:11 dnadlinger