QuantumOpticsBase.jl
QuantumOpticsBase.jl copied to clipboard
Implement the OrdinaryDiffEq interface for Kets
TODO
- [x] enough broadcasting for
StateVectors
andSchroedinger
equations to work naively withOrdinaryDiffEq
- [x] same for density matrices and Lindblad
- [ ] same for simple Monte Carlo
- [ ] semiclassical
- [ ] stochastic
- [ ] for all the them, the speed and allocations should be the same or better
- [ ] sparse states?
Actually converting the current solvers to use this simpler "no-recast" interface is out of scope to this pull request.
Old message:
The following now works
using QuantumOptics
using DifferentialEquations
ℋ = SpinBasis(1//2)
σx = sigmax(ℋ)
↓ = s = spindown(ℋ)
schrod(ψ,p,t) = im * σx * ψ
t₀, t₁ = (0.0, pi)
Δt = 0.1
prob = ODEProblem(schrod, ↓, (t₀, t₁))
sol = solve(prob,Tsit5())
It works for Bras as well. It works for in-place operations, however there are spurrious allocations due to inefficient broadcasting that ruin the performance.
These changes are self-contained and let you define ODEProblem
s directly on the the StateVector
objects. However, the current in-place version actually make more allocations than the naive version... Something is quite wrong.
Here is a quick example:
t₀, t₁ = (0.0, pi)
ℋ = SpinBasis(1//2)
σx = sigmax(ℋ)
↓ = spindown(ℋ)
iσx = im * σx
schrod!(dψ,ψ,p,t) = mul!(dψ, iσx, ψ)
prob = ODEProblem(schrod!, ↓, (t₀, t₁))
@benchmark sol = solve(prob,DP5(),save_everystep=false) # Allocates ~600KiB
schrod(ψ,p,t) = iσx*ψ
prob = ODEProblem(schrod, ↓, (t₀, t₁))
@benchmark sol = solve(prob,DP5(),save_everystep=false) # Allocates ~170KiB which is less than the schrod! !?!?
@benchmark timeevolution.schroedinger([t₀,t₁], ↓, σx) # Allocates ~12KiB
Help with this would be appreciated :) I am fairly sure it is not due to the current broadcasting implementation (which does have some overhead, but it is a constant overhead, not something that scales with size). Any idea how to find what allocates so much?
Thanks for digging into this!
I think it's the broadcasting that ruins the performance here though. Replacing the in-place broadcasting method here https://github.com/qojulia/QuantumOpticsBase.jl/blob/201b6305440b456b4b1936356e82735cfe6b2f90/src/states.jl#L242 with
@inline function Base.copyto!(dest::Ket{B}, bc::Broadcast.Broadcasted{Style,Axes,F,Args}) where {B<:Basis,Style<:KetStyle{B},Axes,F,Args}
axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
bc′ = Base.Broadcast.preprocess(dest, bc)
dest′ = dest.data
@inbounds @simd for I in eachindex(bc′)
dest′[I] = bc′[I]
end
return dest
end
Base.getindex(st::StateVector, idx) = getindex(st.data, idx)
which is essentially the example Chris Rackauckas posted in your discourse thread: https://github.com/SciML/OrdinaryDiffEq.jl/blob/4a110ca8761845b3aa667a109583775c4c718132/test/interface/noindex_tests.jl#L23
Then I get
julia> @benchmark sol = solve($prob,$alg,save_everystep=false)
BenchmarkTools.Trial:
memory estimate: 9.92 KiB
allocs estimate: 133
--------------
minimum time: 14.402 μs (0.00% GC)
median time: 16.446 μs (0.00% GC)
mean time: 17.666 μs (3.53% GC)
maximum time: 6.282 ms (99.16% GC)
--------------
samples: 10000
evals/sample: 1
for the in-place version of your code. This is much better, but still not as good as timeevolution.schroedinger
and the difference scales with the problem size. So I guess we'd really need a cleaner and more performant implementation of broadcasting here.
FYI you're also benchmarking in global scope and use some non-constant globals which you should avoid to get proper results. Either put all the benchmarking inside a function or make the globals constant (in this case you need const iσx = im * σx
for schrod!
to be type-stable) and quote the global variables such as prob
when calling @benchmark
using a $
.
Actually the amount of additional allocations of the in-place version is constant in that the "allocs estimate" is the same regardless of the problem size. The memory size increases though and is always higher than with timeevolution.schroedinger
. What's odd is that it's actually faster. Changing to ℋ = SpinBasis(20//2)
I get
For the in-place solve
BenchmarkTools.Trial:
memory estimate: 15.67 KiB
allocs estimate: 145
--------------
minimum time: 101.143 μs (0.00% GC)
median time: 109.095 μs (0.00% GC)
mean time: 111.976 μs (0.94% GC)
maximum time: 5.413 ms (97.68% GC)
--------------
samples: 10000
evals/sample: 1
for timeevolution.schroedinger
:
BenchmarkTools.Trial:
memory estimate: 13.62 KiB
allocs estimate: 73
--------------
minimum time: 155.181 μs (0.00% GC)
median time: 168.481 μs (0.00% GC)
mean time: 170.973 μs (0.30% GC)
maximum time: 5.366 ms (96.15% GC)
--------------
samples: 10000
evals/sample: 1
Another chunk of performance is lost in the out-of-place broadcasting method. Using the following:
@inline function Base.copy(bc::Broadcast.Broadcasted{Style,Axes,F,Args}) where {B<:Basis,Style<:KetStyle{B},Axes,F,Args<:Tuple}
bcf = Broadcast.flatten(bc)
b = find_basis(bcf)
T = find_dType(bcf)
data = zeros(T, length(b))
@inbounds @simd for I in eachindex(bcf)
data[I] = bcf[I]
end
return Ket{B}(b, data)
end
for f ∈ [:find_basis,:find_dType]
@eval ($f)(bc::Broadcast.Broadcasted) = ($f)(bc.args)
@eval ($f)(args::Tuple) = ($f)(($f)(args[1]), Base.tail(args))
@eval ($f)(x) = x
@eval ($f)(::Any, rest) = ($f)(rest)
end
find_basis(a::StateVector, rest) = a.basis
find_dType(a::StateVector, rest) = eltype(a)
puts me really close in speed and allocations for your original example. The allocations also seem to scale a bit better. For my spin-20//2 example, I get 13.95KiB
vs the 13.62KiB
from the timeevolution
call.
I am a bit lost what is happening here. Could you explain what Broadcast.preprocess
does? Also, what does indexing a Broadcasted
object like bcf[I]
do? I did not find documentation on these. Lastly, how come this still works with scalars (I had to implement a getdata
method that is just x->x
for scalars)?
I updated the pull request with the changes you explained and also added some extra tests. In some cases, it is curiously faster than the timeevolution.schroedinger
method.
ℋ = SpinBasis(20//1)
↓ = spindown(ℋ)
t₀, t₁ = (0.0, pi)
const σx = sigmax(ℋ)
const iσx = im * σx
schrod!(dψ,ψ,p,t) = mul!(dψ, iσx, ψ)
prob! = ODEProblem(schrod!, ↓, (t₀, t₁))
julia> @benchmark sol = solve($prob!,DP5(),save_everystep=false)
BenchmarkTools.Trial:
memory estimate: 22.67 KiB
allocs estimate: 178
--------------
minimum time: 374.463 μs (0.00% GC)
median time: 397.327 μs (0.00% GC)
mean time: 406.738 μs (0.37% GC)
maximum time: 4.386 ms (89.76% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark timeevolution.schroedinger([$t₀,$t₁], $↓, $σx)
BenchmarkTools.Trial:
memory estimate: 23.34 KiB
allocs estimate: 161
--------------
minimum time: 748.106 μs (0.00% GC)
median time: 774.601 μs (0.00% GC)
mean time: 786.933 μs (0.14% GC)
maximum time: 4.459 ms (80.46% GC)
--------------
samples: 6350
evals/sample: 1
Codecov Report
Merging #16 (4ca6500) into master (201b630) will increase coverage by
0.14%
. The diff coverage is77.19%
.
@@ Coverage Diff @@
## master #16 +/- ##
==========================================
+ Coverage 93.61% 93.76% +0.14%
==========================================
Files 24 24
Lines 2475 2486 +11
==========================================
+ Hits 2317 2331 +14
+ Misses 158 155 -3
Impacted Files | Coverage Δ | |
---|---|---|
src/superoperators.jl | 82.65% <ø> (ø) |
|
src/operators_dense.jl | 89.15% <70.00%> (-2.64%) |
:arrow_down: |
src/states.jl | 76.92% <81.08%> (+7.58%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 201b630...4ca6500. Read the comment docs.
I also updated the issue description with a short TODO. Does the current list sound reasonable to you? Next weekend I might have some time to make the listed changes, but feel free to jump in in case you are impatient ;)
I am a bit lost what is happening here. Could you explain what Broadcast.preprocess does? Also, what does indexing a Broadcasted object like bcf[I] do?
I got them from Chris Rackauckas' example, but had to look them up myself. If I understand correctly, the preprocess
function does two things: it unaliases the destination from the source, and it precomputes the size of the output when broadcasting with arrays which is sometimes beneficial for performance.
Indexing bcf[I]
propagates the indexes through the arguments and calls the broadcasting function on them, so the element corresponding to dest[I]
is computed.
Lastly, how come this still works with scalars (I had to implement a getdata method that is just x->x for scalars)?
I'm not 100% sure what you're asking here, but I'll try to answer anyway. The getdata
method was there because broadcasting was done on the .data
fields, so the arguments were unwrapped before the broadcast was evaluated. This was quite bad for performance, so now we're just indexing a StateVector
directly with the new getindex
method. When indexing a broadcasted object like bcf[I]
the index gets ignored for scalars.
I updated the pull request with the changes you explained and also added some extra tests. In some cases, it is curiously faster than the timeevolution.schroedinger method.
Well we're actually comparing two different things. timeevolution.schroedinger
creates the ODEProblem
and then calls solve
(with different tolerances and a SavingCallback
) whereas we only look at solve
. The real benchmark we want to compare against is a solve
with pure Julia types:
const ix = iσx.data
schrod_data!(dψ,ψ,p,t) = mul!(dψ, ix, ψ)
u0 = (↓).data
prob_data! = ODEProblem(schrod_data!, u0, (t₀, t₁))
alg = DP5()
@benchmark solve($prob_data!, $alg, save_everystep=false)
matching the performance of this (with a constant overhead) is the goal.
I also updated the issue description with a short TODO. Does the current list sound reasonable to you?
Those sound good! You could also split the broadcasting for StateVector
s and Operator
s into two separate PRs since they are self-contained, but you already started with the latter so whichever you prefer is fine.
Finally, just for reference, we can in principle do the same for the semiclassical
and stochastic
modules, but that is definitely for another time.
@david-pl, I looked through the rest of the TODOs, but it seems most of that work is to be done in QuantumOptics, not in QOBase. Does it make sense to merge this PR now, and then I will make an issue listing the rest of the TODOs and start working on it in a QuantumOptics pull request?
An exception to this is figuring out what is necessary in order to work with sparse state vectors, but that might be out of scope.
@Krastanov Before moving on to the other time evolution functions, we should make sure the ones that already work are performant. This needs to be done in QOBase. Currently, there are allocations and slowdowns happening using Ket
s and Operator
s in ODEProblem
s. I attempted a couple of different things to achieve the same performance as with pure Julia vectors but didn't succeed so far. I still think that it's due to the broadcasting, but I'm not sure.
An exception to this is figuring out what is necessary in order to work with sparse state vectors, but that might be out of scope.
I'd say this is definitely out of scope for now.
This has been stale for a long time. Is there any update on this?
@david-pl @Krastanov Additional allocations come from the fact that recursivecopy
in RecursiveArrayTools calls deepcopy
by default (https://github.com/SciML/RecursiveArrayTools.jl/blob/e0f1b1ddc98671ade78fb4fd89339224e145fb95/src/utils.jl#L9), which can be problematic in our case. So, if you write
RecursiveArrayTools.recursivecopy(x::Ket{B,A}) where {B,A} = typeof(x)(x.basis, copy(x.data))
RecursiveArrayTools.recursivecopy(x::Bra{B,A}) where {B,A} = typeof(x)(x.basis, copy(x.data))
and now run the example, we get
using QuantumOptics, DifferentialEquations, BenchmarkTools
ℋ = SpinBasis(20//1)
↓ = spindown(ℋ)
t₀, t₁ = (0.0, pi)
const σx = sigmax(ℋ)
const iσx = im*σx
schrod!(dψ, ψ, p, t) = QuantumOptics.mul!(dψ, iσx, ψ)
prob! = ODEProblem(schrod!, ↓, (t₀, t₁))
julia> @benchmark sol = solve($prob!,DP5(),save_everystep=false)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 161.750 μs … 202.166 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 165.834 μs ┊ GC (median): 0.00%
Time (mean ± σ): 166.360 μs ± 2.665 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▅▇▄██▅▆▅▃▁
▁▁▁▁▂▃▃▄▅▇█████████████▇▆▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
162 μs Histogram: frequency by time 176 μs <
Memory estimate: 13.98 KiB, allocs estimate: 58.
julia> @benchmark timeevolution.schroedinger([$t₀,$t₁], $↓, $σx)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 283.708 μs … 622.208 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 290.334 μs ┊ GC (median): 0.00%
Time (mean ± σ): 293.181 μs ± 11.737 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▆▇███▇▆▆▅▄▃▃▃▂▂▂▂▂▁▁ ▁ ▃
▆██████████████████████████████████▇▇▆▇▆▇▅▇▅▆▆▆▅▅▄▆▄▄▄▄▄▅▄▂▄▅ █
284 μs Histogram: log(frequency) by time 340 μs <
Memory estimate: 18.31 KiB, allocs estimate: 88.
This leads to a pretty significant performance improvement.
It's also now pretty comparable to using solve
with pure Julia types:
const ix = iσx.data
schrod_data!(dψ,ψ,p,t) = QuantumOptics.mul!(dψ, ix, ψ)
u0 = (↓).data
prob_data! = ODEProblem(schrod_data!, u0, (t₀, t₁))
julia> @benchmark solve($prob_data!, DP5(), save_everystep=false)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 87.125 μs … 199.208 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 90.333 μs ┊ GC (median): 0.00%
Time (mean ± σ): 91.042 μs ± 3.810 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▄▅▇█▇▅▂▁
▂▂▃▄▆███████████▆▆▆▆▅▅▅▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▁▂▂▂▂▂ ▄
87.1 μs Histogram: frequency by time 104 μs <
Memory estimate: 13.84 KiB, allocs estimate: 47.
@apkille thanks for taking a look at this! The allocations from deepcopy are a great find.
However, I don't think we're quite there yet. The first benchmark you mention is again comparing timeevolution.schroedinger
with DiffEq.solve
. The first does a lot of extra steps to build the ODEProblem
, which is done repeatedly in a benchmark. So that's not really a fair comparison.
Now for the second one: this is the benchmark we need to look at since it really measures the impact of using QO.jl types.
It's also now pretty comparable to using solve with pure Julia types
Well, if I read the timings correctly, the time it takes almost doubles. The additional time cost may be okay if it's a constant offset, but I previously saw that it scales with system size meaning you'd be significantly slowing down time evolution computations here. Could you maybe check the behaviour with increasing system again? Also, as @Krastanov mentioned in an earlier comment: it should be possible to do this in julia with zero cost. It just seems to be a bit tricky to actually get there.
@david-pl If we increase the problem size with ℋ = SpinBasis(100//1)
, for QO.jl types we get
julia> @benchmark sol = solve($prob!,DP5(),save_everystep=false)
BenchmarkTools.Trial: 2008 samples with 1 evaluation.
Range (min … max): 2.450 ms … 2.684 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.478 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.487 ms ± 29.602 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▆▄▃▂
▂▄█████████▆▆▆▄▄▅▆▅▅▅▅▆▆▆▆▅▅▅▃▄▃▃▃▂▂▂▂▁▂▂▂▁▂▁▂▁▁▁▂▁▂▁▁▁▁▂▁ ▃
2.45 ms Histogram: frequency by time 2.58 ms <
Memory estimate: 53.83 KiB, allocs estimate: 58.
and for pure Julia types we get
julia> @benchmark solve($prob_data!, DP5(), save_everystep=false)
BenchmarkTools.Trial: 2354 samples with 1 evaluation.
Range (min … max): 2.101 ms … 2.433 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.119 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.123 ms ± 19.607 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▆█▇██▆▆▇▅▂▁
▂▃▅█████████████▅▅▄▄▃▃▃▂▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▂▃▂▂▂▂▂▃▂▂▂▁▂▂▂▂▂▂▂ ▄
2.1 ms Histogram: frequency by time 2.21 ms <
Memory estimate: 56.34 KiB, allocs estimate: 47.
So the allocations are the same, but there is still a time difference. Although the time it takes does not almost double, as was the case for ℋ = SpinBasis(20//1)
. Also, surprisingly the memory estimate for QO types is smaller than Julia types.
Anyways, I was mainly trying to find the big culprit of the additional allocations in the above comment about deepcopy
. I can do a deeper search of smaller allocations and try to whittle it down further so that using DiffEq with QO types gives zero cost.
@apkille brilliant, thanks for digging into this. I'd love to merge this so we have native DiffEq support ;)
@david-pl Here's another decent improvement on allocations and memory, which comes from also defining a recursivecopy
method for abstract arrays of state vectors:
RecursiveArrayTools.recursivecopy(x::AbstractArray{T}) where {T<:StateVector} = copy(x)
Using the example from above for ℋ = SpinBasis(100//1)
we get with QO types the following:
julia> @benchmark sol = solve($prob!,DP5(),save_everystep=false)
BenchmarkTools.Trial: 2012 samples with 1 evaluation.
Range (min … max): 2.447 ms … 2.696 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.474 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.485 ms ± 29.522 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▇██▆▃▂
▂▄█████████▇▆▇▆▆▆▇▆▇▆██▇▇█▇▇▅▅▃▄▃▃▃▄▃▃▂▃▂▃▃▃▂▂▃▂▂▃▂▂▂▁▂▂▂▂ ▄
2.45 ms Histogram: frequency by time 2.59 ms <
Memory estimate: 49.86 KiB, allocs estimate: 49.
So the allocations are now pretty much the same and the memory is much better for QO types. Although the mean time is greater by ~0.3 ms. There's probably one or two allocations in Base.copy(bc::Broadcast.Broadcasted{Style,Axes,F,Args})
when we create the zeros array, but besides that I don't think the broadcasting has any effect on performance.
@apkille brilliant, thanks for digging into this. I'd love to merge this so we have native DiffEq support ;)
Me too. I'm happy to continue this work for other modules in QO.jl as well. I think a lot of users would find it quite useful, and I geek out on all of the SciML features :)
There's probably one or two allocations in Base.copy(bc::Broadcast.Broadcasted{Style,Axes,F,Args}) when we create the zeros array, but besides that I don't think the broadcasting has any effect on performance.
Yeah, there's probably some limitation on what we can do with broadcasting while keeping track of the bases of the state.
Anyway, I thought a bit more about this and maybe your version is already more efficient when used in timeevolution_base.jl
. While the comparison now is fair and we'd like to minimize the difference, there's still some extra stuff we do internally which slightly differs from the pure Julia implementation. E.g., we use recast!
to switch between QO types and vectors. The performance impact shouldn't be large but it may be worth checking whether you can get a speedup of timeevolution.X
when using the QO types version in solve
without recast!
and such.
Yeah, there's probably some limitation on what we can do with broadcasting while keeping track of the bases of the state.
I agree. Comparing QO types against pure Julia types is certainly the right approach, but I would be surprised if we ever get zero performance difference between the two, simply because of the additional (yet necessary) steps for working with QO types. That said, I'm all for making things efficient.
Anyway, I thought a bit more about this and maybe your version is already more efficient when used in
timeevolution_base.jl
. While the comparison now is fair and we'd like to minimize the difference, there's still some extra stuff we do internally which slightly differs from the pure Julia implementation. E.g., we userecast!
to switch between QO types and vectors. The performance impact shouldn't be large but it may be worth checking whether you can get a speedup oftimeevolution.X
when using the QO types version insolve
withoutrecast!
and such.
I'll make a new PR that looks into this.
but I would be surprised if we ever get zero performance difference between the two
You're probably right, but I still think it should be possible in principle to get the same performance here. I mean a Ket
is just a vector that keeps track of the basis in which it is defined. Why shouldn't we get the same speed in a time evolution as you'd get with just a vector?
But yes, there are a bunch of technical details that cause an overhead here, I suppose.
I'll make a new PR that looks into this.
Awesome, thanks!
closed in favor of #172