Significant overhead when using GPU and ITensors
Hi, I am simulating a quantum dynamical system using the great ITensorMPS.jl package. (https://github.com/ITensor/ITensorMPS.jl) Without getting into details about this package and the specific computation, I want to point out a seemingly significant overhead in computation when running a subroutine of the 1TDVP algorithm using "exponentiate" with a GPU backend. (it is used exactly here- https://github.com/ITensor/ITensorMPS.jl/blob/main/src/solvers/tdvp.jl)
This algorithm solves the (time-dependent-)Schrodinger equation by projecting the Hamiltonian onto a single MPS site. Then, the "exponentiate" function is used to solve the single-site effective equation. In principle, any ODE solver can solve the effective equation, but they usually work with exponentiate.
Is the significant overhead with the GPU backend compared with the CPU backend expected or known?
Hi!
It would be really interesting if you could maybe share a profiler run of what you are trying to achieve. In principle, KrylovKit should be compatible with GPU-based backends, with the limitation that they still perform CPU linear algebra on the smaller matrices that characterize the Krylov subspaces. Depending on the interplay of these packages, it might be that there are accidentally a large number of transfers between CPU and GPU memory that are hindering the performance, or that it is some other thing that might be going wrong, or that it really is just that there is quite a bit of overhead that cannot be avoided. Without further information, it's a bit hard to give a better answer.
Thanks for the answer; it made me check my statement more closely.
At the end of this post, you can find a code built on example no.1 from ITensorMPS.jl. (https://github.com/ITensor/ITensorMPS.jl/blob/main/examples/solvers/01_tdvp.jl) This code performs imaginary time evolution to a 1D chain of spins defined by a Heisenberg Hamiltonian. As a function of the maximal bond dimension, which depends on the number of spins (n, in code), the CPU or GPU backends can be favorable. In larger problems (larger bond dimensions), the GPU performs better, as expected.
I have not tried to run a profile yet, but I will get to that soon.
It can be interesting to hear an opinion from someone more experienced with these considerations: for a given problem size, say n=20 and maxbondim=100, does the GPU backend have the expected speedup over the CPU (in my laptop, it was faster by around 0.75 per time step- 3sec compared with 4sec).
Here's the code:
using ITensorMPS: MPS, MPO, OpSum, dmrg, inner, random_mps, siteinds, tdvp, normalize!
using CUDA: cu
function main(; n=4, device=identity, maxdim=30, updater_kwargs=(;))
# This function simulates imaginary time-evolution of Heisenberg Hamiltonian.
s = siteinds("S=1/2", n)
t_stop = 10.
t_step = 1.
if device == cu
t_stop = Float32(t_stop)
t_step = Float32(t_step)
end
function heisenberg(n)
os = OpSum()
for j in 1:(n - 1)
os += 0.5, "S+", j, "S-", j + 1
os += 0.5, "S-", j, "S+", j + 1
os += "Sz", j, "Sz", j + 1
end
return os
end
H = device(MPO(heisenberg(n), s))
ψ = device(MPS(s, [k%2==0 ? "↑" : "↓" for k=1:n]))
e_init = inner(ψ', H, ψ) / inner(ψ, ψ)
@show e_init
ϕ = tdvp(
H,
-t_stop,
ψ;
time_step=-t_step,
maxdim=maxdim,
cutoff=1e-17,
normalize=true,
reverse_step=false,
outputlevel=1,
updater_backend="exponentiate",
updater_kwargs=updater_kwargs
)
e_tdvp = inner(ϕ', H, ϕ) / inner(ϕ, ϕ)
@show e_tdvp
e_dmrg, _ = dmrg(H, ψ; nsweeps=10, maxdim=20, cutoff=1e-10, outputlevel=0)
@show e_dmrg
return nothing
end
n = 4 # number of spins
maxdim = 30 # maximal bond dimension of MPS during tdvp sweeps
updater_backend = "CPU" # updater backend, choose between "GPU" and "CPU".
device= updater_backend=="CPU" ? identity : cu
@time main(; n=n, device=device, maxdim=maxdim , updater_kwargs=(;))
Unrelated to this issue, @XingyuZhang2018 has also been looking at performance of KrylovKit with CUDA / GPU data and found it to be not meeting expectations. It is thus probably nothing todo with ITensors, and I would like to reorient this issue towards discussing GPU performance.
A first attempted explanation for possible suboptimal performance, was the observation that dot has terrible performance on nested arrays, as indicated by the following timings:
julia> @testset "dot $atype" for atype in [Array, CuArray]
Random.seed!(100)
N = 10^2
A = atype(rand(ComplexF64, N,N))
B = [A]
C = [B]
D = [C]
@btime CUDA.@sync dot($A, $A)
@btime CUDA.@sync dot($B, $B)
@btime CUDA.@sync dot($C, $C)
@btime CUDA.@sync dot($D, $D)
end
2.876 μs (0 allocations: 0 bytes)
5.111 μs (0 allocations: 0 bytes)
9.600 μs (0 allocations: 0 bytes)
18.579 μs (0 allocations: 0 bytes)
Test Summary: | Total Time
dot Array | 0 21.6s
16.030 μs (18 allocations: 304 bytes)
36.399 μs (36 allocations: 608 bytes)
75.128 μs (72 allocations: 1.19 KiB)
161.297 μs (144 allocations: 2.38 KiB)
Test Summary: | Total Time
dot CuArray | 0 29.9s
However, that is the same for Array and CuArray alike, and furthermore, cannot be the cause as KrylovKit.jl uses VectorInterface.inner. inner calls dot when acting on DenseArray{<:BlasFloat} types, but has a custom definition for all other arrays, which does not seem to suffer from this "bug" in dot:
julia> using VectorInterface
julia> @testset "dot $atype" for atype in [Array, CuArray]
Random.seed!(100)
N = 10^2
A = atype(rand(ComplexF64, N,N))
B = [A]
C = [B]
D = [C]
@btime CUDA.@sync inner($A, $A)
@btime CUDA.@sync inner($B, $B)
@btime CUDA.@sync inner($C, $C)
@btime CUDA.@sync inner($D, $D)
end
3.021 μs (0 allocations: 0 bytes)
3.073 μs (2 allocations: 80 bytes)
3.106 μs (4 allocations: 160 bytes)
3.134 μs (6 allocations: 240 bytes)
Test Summary: | Total Time
dot Array | 0 15.0s
16.381 μs (18 allocations: 304 bytes)
15.587 μs (20 allocations: 384 bytes)
15.949 μs (22 allocations: 464 bytes)
15.470 μs (24 allocations: 544 bytes)
Test Summary: | Total Time
dot CuArray | 0 27.2s
2-element Vector{Any}:
Test.DefaultTestSet("dot Array", Any[], 0, false, false, true, 1.732188137110498e9, 1.732188152078698e9, false, "REPL[11]")
Test.DefaultTestSet("dot CuArray", Any[], 0, false, false, true, 1.732188152078942e9, 1.732188179328664e9, false, "REPL[11]")
Actually, it would still be interesting to see where the allocations are coming from in those cases, but they do not seem to affect timings significantly.
I found this presentation which is potentially interesting: https://cerfacs.fr/wp-content/uploads/2016/03/tomas.pdf
This discusses the case where BLAS level 2 operations are used to do the orthogonalisation of the Krylov vectors. KrylovKit.jl doesn't even use BLAS level 2 but only level 1, because of how it stores vectors in general, but it might be useful to specialise to different behaviour for the case where the vectors are of type DenseArray{<:BlasFloat}. However, this probably needs to be done together with supporting linear operators that act in place, and does require some significant undertaking.
However, as indicated in these slides, even the BLAS level 2 operations are suboptimal. It might be that we want to write some custom kernels for CuArray vectors, but that is not something I have experience with.
The problem can be seen directly in this easy example:
@testset "eigsolve $atype" for atype in [Array, CuArray]
Random.seed!(100)
N = 10^3
A = [atype(rand(ComplexF64, N, N)) for i in 1:4]
v0 = [atype(rand(ComplexF64, N)) for i in 1:4]
linearmap(v) = A .* v
@btime CUDA.@sync inner($v0, $v0)
@btime CUDA.@sync $linearmap($v0)
@btime CUDA.@sync λs, vs = eigsolve(v -> $linearmap(v), $v0, 1, :LM)
end
855.385 ns (2 allocations: 128 bytes)
613.000 μs (14 allocations: 62.84 KiB)
23.762 ms (3015 allocations: 4.40 MiB)
Test Summary: | Total Time
eigsolve Array | 0 25.3s
142.500 μs (58 allocations: 1.06 KiB)
72.000 μs (94 allocations: 2.03 KiB)
4.707 s (124214 allocations: 2.22 MiB)
Test Summary: | Total Time
eigsolve CuArray | 0 47.3s
The linearmap is faster in GPU but inner and eigsolve is much slower.
I guess that inner is also not that easy to write performantly on a GPU, since it requires a reduction, and not a lot of parallel multiplications. In that sense, I would guess there is more benefit from a specialized kernel more than the simple mapreduce call currently in VectorInterface
That is a great test case to focus our attention on!
Is the fact that you have several vectors wrapped in a list essential to this, or is it already manifest with just a single CuArray vector and CuArray matrix?
Yes, single CuArray vector is the same, even for a block CuArray ; I think the main overhead comes from the allocations of dot in GPU.
I think the reason is that: vector-vector inner product is too simple to be calculated when N~10^3, thus, the CPU could be faster than GPU in these cases, instead of "Array of Array is slow".
In my benchmark, the A*v take nearly the same time comparing to [A].*[v].
julia> @testset "bare eigsolve $atype" for atype in [Array, CuArray]
Random.seed!(100)
N = 10^3
A = atype(rand(ComplexF64, N, N))
v0 = atype(rand(ComplexF64, N))
linearmap(v) = A * v
@btime CUDA.@sync inner($v0, $v0)
@btime CUDA.@sync $linearmap($v0)
@btime CUDA.@sync λs, vs = eigsolve(v -> $linearmap(v), $v0, 1, :LM)
end
1.003 μs (0 allocations: 0 bytes)
45.169 μs (3 allocations: 15.69 KiB)
2.713 ms (352 allocations: 1.06 MiB)
Test Summary: | Total Time
bare eigsolve Array | 0 25.9s
15.950 μs (18 allocations: 304 bytes)
28.550 μs (26 allocations: 544 bytes)
27.160 ms (35397 allocations: 692.83 KiB)
Test Summary: | Total Time
bare eigsolve CuArray | 0 44.2s
2-element Vector{Any}:
Test.DefaultTestSet("bare eigsolve Array", Any[], 0, false, false, true, 1.736115066765964e9, 1.736115092664603e9, false, "REPL[3]")
Test.DefaultTestSet("bare eigsolve CuArray", Any[], 0, false, false, true, 1.736115092672029e9, 1.736115136886287e9, false, "REPL[3]")
julia> @testset "eigsolve $atype" for atype in [Array, CuArray]
Random.seed!(100)
N = 10^3
A = [atype(rand(ComplexF64, N, N)) for i in 1:1]
v0 = [atype(rand(ComplexF64, N)) for i in 1:1]
linearmap(v) = A .* v
@btime CUDA.@sync inner($v0, $v0)
@btime CUDA.@sync $linearmap($v0)
@btime CUDA.@sync λs, vs = eigsolve(v -> $linearmap(v), $v0, 1, :LM)
end
1.290 μs (2 allocations: 80 bytes)
45.370 μs (5 allocations: 15.75 KiB)
2.848 ms (2369 allocations: 1.13 MiB)
Test Summary: | Total Time
eigsolve Array | 0 21.3s
16.080 μs (20 allocations: 384 bytes)
28.660 μs (28 allocations: 608 bytes)
28.375 ms (37781 allocations: 776.12 KiB)
My suggestion is to consider initializing U, the unitary matrix which spans the Krylov space, filling it with zeros. And update it in-place, which will eliminate all allocation. When perform Gram-Schmit, just call v = v-U^T*(U*v), thus will use BLAS-level2 instead of BLAS-level1, which will be much more friendly for GPU.
The eigsolve result specifically is rather confusing to me, let me profile it and see what pops up.