KrylovKit.jl Significant overhead when using GPU and ITensors

Hi, I am simulating a quantum dynamical system using the great ITensorMPS.jl package. (https://github.com/ITensor/ITensorMPS.jl) Without getting into details about this package and the specific computation, I want to point out a seemingly significant overhead in computation when running a subroutine of the 1TDVP algorithm using "exponentiate" with a GPU backend. (it is used exactly here- https://github.com/ITensor/ITensorMPS.jl/blob/main/src/solvers/tdvp.jl)

This algorithm solves the (time-dependent-)Schrodinger equation by projecting the Hamiltonian onto a single MPS site. Then, the "exponentiate" function is used to solve the single-site effective equation. In principle, any ODE solver can solve the effective equation, but they usually work with exponentiate.

Is the significant overhead with the GPU backend compared with the CPU backend expected or known?

Nov 11 '24 14:11 YotamKa

Hi!

It would be really interesting if you could maybe share a profiler run of what you are trying to achieve. In principle, KrylovKit should be compatible with GPU-based backends, with the limitation that they still perform CPU linear algebra on the smaller matrices that characterize the Krylov subspaces. Depending on the interplay of these packages, it might be that there are accidentally a large number of transfers between CPU and GPU memory that are hindering the performance, or that it is some other thing that might be going wrong, or that it really is just that there is quite a bit of overhead that cannot be avoided. Without further information, it's a bit hard to give a better answer.

Nov 11 '24 15:11 lkdvos

Thanks for the answer; it made me check my statement more closely.

At the end of this post, you can find a code built on example no.1 from ITensorMPS.jl. (https://github.com/ITensor/ITensorMPS.jl/blob/main/examples/solvers/01_tdvp.jl) This code performs imaginary time evolution to a 1D chain of spins defined by a Heisenberg Hamiltonian. As a function of the maximal bond dimension, which depends on the number of spins (n, in code), the CPU or GPU backends can be favorable. In larger problems (larger bond dimensions), the GPU performs better, as expected.

I have not tried to run a profile yet, but I will get to that soon.

It can be interesting to hear an opinion from someone more experienced with these considerations: for a given problem size, say n=20 and maxbondim=100, does the GPU backend have the expected speedup over the CPU (in my laptop, it was faster by around 0.75 per time step- 3sec compared with 4sec).

Here's the code:

using ITensorMPS: MPS, MPO, OpSum, dmrg, inner, random_mps, siteinds, tdvp, normalize!
using CUDA: cu

function main(; n=4, device=identity, maxdim=30, updater_kwargs=(;))

  # This function simulates imaginary time-evolution of Heisenberg Hamiltonian.

  s = siteinds("S=1/2", n)
  t_stop = 10.
  t_step = 1.

  if device == cu
    t_stop = Float32(t_stop)
    t_step = Float32(t_step)
  end

  function heisenberg(n)
    os = OpSum()
    for j in 1:(n - 1)
      os += 0.5, "S+", j, "S-", j + 1
      os += 0.5, "S-", j, "S+", j + 1
      os += "Sz", j, "Sz", j + 1
    end
    return os
  end

  H = device(MPO(heisenberg(n), s))
  ψ = device(MPS(s, [k%2==0 ? "↑" : "↓" for k=1:n]))

  e_init = inner(ψ', H, ψ) / inner(ψ, ψ)
  @show e_init

  ϕ = tdvp(
    H,
    -t_stop,
    ψ;
    time_step=-t_step,
    maxdim=maxdim,
    cutoff=1e-17,
    normalize=true,
    reverse_step=false,
    outputlevel=1,
    updater_backend="exponentiate",
    updater_kwargs=updater_kwargs
  )
  e_tdvp = inner(ϕ', H, ϕ) / inner(ϕ, ϕ)
  @show e_tdvp

  e_dmrg, _ = dmrg(H, ψ; nsweeps=10, maxdim=20, cutoff=1e-10, outputlevel=0)
  @show e_dmrg

  return nothing
end

n = 4 # number of spins
maxdim = 30 # maximal bond dimension of MPS during tdvp sweeps
updater_backend = "CPU" # updater backend, choose between "GPU" and "CPU".
device= updater_backend=="CPU" ? identity : cu 

@time main(; n=n, device=device, maxdim=maxdim , updater_kwargs=(;))

Nov 12 '24 15:11 YotamKa

Unrelated to this issue, @XingyuZhang2018 has also been looking at performance of KrylovKit with CUDA / GPU data and found it to be not meeting expectations. It is thus probably nothing todo with ITensors, and I would like to reorient this issue towards discussing GPU performance.

Nov 21 '24 12:11 Jutho

A first attempted explanation for possible suboptimal performance, was the observation that dot has terrible performance on nested arrays, as indicated by the following timings:

julia> @testset "dot $atype" for atype in [Array, CuArray]
           Random.seed!(100)
           N = 10^2
           A = atype(rand(ComplexF64, N,N))
           B = [A]
           C = [B]
           D = [C]
           @btime CUDA.@sync dot($A, $A)
           @btime CUDA.@sync dot($B, $B)
           @btime CUDA.@sync dot($C, $C)
           @btime CUDA.@sync dot($D, $D)
       end
  2.876 μs (0 allocations: 0 bytes)
  5.111 μs (0 allocations: 0 bytes)
  9.600 μs (0 allocations: 0 bytes)
  18.579 μs (0 allocations: 0 bytes)
Test Summary: | Total   Time
dot Array     |     0  21.6s
  16.030 μs (18 allocations: 304 bytes)
  36.399 μs (36 allocations: 608 bytes)
  75.128 μs (72 allocations: 1.19 KiB)
  161.297 μs (144 allocations: 2.38 KiB)
Test Summary: | Total   Time
dot CuArray   |     0  29.9s

However, that is the same for Array and CuArray alike, and furthermore, cannot be the cause as KrylovKit.jl uses VectorInterface.inner. inner calls dot when acting on DenseArray{<:BlasFloat} types, but has a custom definition for all other arrays, which does not seem to suffer from this "bug" in dot:

julia> using VectorInterface
julia> @testset "dot $atype" for atype in [Array, CuArray]
           Random.seed!(100)
           N = 10^2
           A = atype(rand(ComplexF64, N,N))
           B = [A]
           C = [B]
           D = [C]
           @btime CUDA.@sync inner($A, $A)
           @btime CUDA.@sync inner($B, $B)
           @btime CUDA.@sync inner($C, $C)
           @btime CUDA.@sync inner($D, $D)
       end
  3.021 μs (0 allocations: 0 bytes)
  3.073 μs (2 allocations: 80 bytes)
  3.106 μs (4 allocations: 160 bytes)
  3.134 μs (6 allocations: 240 bytes)
Test Summary: | Total   Time
dot Array     |     0  15.0s
  16.381 μs (18 allocations: 304 bytes)
  15.587 μs (20 allocations: 384 bytes)
  15.949 μs (22 allocations: 464 bytes)
  15.470 μs (24 allocations: 544 bytes)
Test Summary: | Total   Time
dot CuArray   |     0  27.2s
2-element Vector{Any}:
 Test.DefaultTestSet("dot Array", Any[], 0, false, false, true, 1.732188137110498e9, 1.732188152078698e9, false, "REPL[11]")
 Test.DefaultTestSet("dot CuArray", Any[], 0, false, false, true, 1.732188152078942e9, 1.732188179328664e9, false, "REPL[11]")

Nov 21 '24 12:11 Jutho

Actually, it would still be interesting to see where the allocations are coming from in those cases, but they do not seem to affect timings significantly.

I found this presentation which is potentially interesting: https://cerfacs.fr/wp-content/uploads/2016/03/tomas.pdf

This discusses the case where BLAS level 2 operations are used to do the orthogonalisation of the Krylov vectors. KrylovKit.jl doesn't even use BLAS level 2 but only level 1, because of how it stores vectors in general, but it might be useful to specialise to different behaviour for the case where the vectors are of type DenseArray{<:BlasFloat}. However, this probably needs to be done together with supporting linear operators that act in place, and does require some significant undertaking.

However, as indicated in these slides, even the BLAS level 2 operations are suboptimal. It might be that we want to write some custom kernels for CuArray vectors, but that is not something I have experience with.

Nov 21 '24 12:11 Jutho

The problem can be seen directly in this easy example:

@testset "eigsolve $atype" for atype in [Array, CuArray]
    Random.seed!(100)
    N = 10^3
    A = [atype(rand(ComplexF64, N, N)) for i in 1:4]
    v0 = [atype(rand(ComplexF64, N)) for i in 1:4]
    linearmap(v) = A .* v
    @btime CUDA.@sync inner($v0, $v0)
    @btime CUDA.@sync $linearmap($v0)
    @btime CUDA.@sync λs, vs = eigsolve(v -> $linearmap(v), $v0, 1, :LM)
end

  855.385 ns (2 allocations: 128 bytes)
  613.000 μs (14 allocations: 62.84 KiB)
  23.762 ms (3015 allocations: 4.40 MiB)
Test Summary:  | Total   Time
eigsolve Array |     0  25.3s
  142.500 μs (58 allocations: 1.06 KiB)
  72.000 μs (94 allocations: 2.03 KiB)
  4.707 s (124214 allocations: 2.22 MiB)
Test Summary:    | Total   Time
eigsolve CuArray |     0  47.3s

The linearmap is faster in GPU but inner and eigsolve is much slower.

Nov 21 '24 12:11 XingyuZhang2018

I guess that inner is also not that easy to write performantly on a GPU, since it requires a reduction, and not a lot of parallel multiplications. In that sense, I would guess there is more benefit from a specialized kernel more than the simple mapreduce call currently in VectorInterface

Nov 21 '24 12:11 lkdvos

That is a great test case to focus our attention on!

Is the fact that you have several vectors wrapped in a list essential to this, or is it already manifest with just a single CuArray vector and CuArray matrix?

Nov 21 '24 13:11 Jutho

Yes, single CuArray vector is the same, even for a block CuArray ; I think the main overhead comes from the allocations of dot in GPU.

Nov 21 '24 15:11 XingyuZhang2018

I think the reason is that: vector-vector inner product is too simple to be calculated when N~10^3, thus, the CPU could be faster than GPU in these cases, instead of "Array of Array is slow".

In my benchmark, the A*v take nearly the same time comparing to [A].*[v].

julia> @testset "bare eigsolve $atype" for atype in [Array, CuArray]
           Random.seed!(100)
           N = 10^3
           A = atype(rand(ComplexF64, N, N))
           v0 = atype(rand(ComplexF64, N))
           linearmap(v) = A * v
           @btime CUDA.@sync inner($v0, $v0)
           @btime CUDA.@sync $linearmap($v0)
           @btime CUDA.@sync λs, vs = eigsolve(v -> $linearmap(v), $v0, 1, :LM)
       end
  1.003 μs (0 allocations: 0 bytes)
  45.169 μs (3 allocations: 15.69 KiB)
  2.713 ms (352 allocations: 1.06 MiB)
Test Summary:       | Total   Time
bare eigsolve Array |     0  25.9s
  15.950 μs (18 allocations: 304 bytes)
  28.550 μs (26 allocations: 544 bytes)
  27.160 ms (35397 allocations: 692.83 KiB)
Test Summary:         | Total   Time
bare eigsolve CuArray |     0  44.2s
2-element Vector{Any}:
 Test.DefaultTestSet("bare eigsolve Array", Any[], 0, false, false, true, 1.736115066765964e9, 1.736115092664603e9, false, "REPL[3]")
 Test.DefaultTestSet("bare eigsolve CuArray", Any[], 0, false, false, true, 1.736115092672029e9, 1.736115136886287e9, false, "REPL[3]")

julia> @testset "eigsolve $atype" for atype in [Array, CuArray]
           Random.seed!(100)
           N = 10^3
           A = [atype(rand(ComplexF64, N, N)) for i in 1:1]
           v0 = [atype(rand(ComplexF64, N)) for i in 1:1]
           linearmap(v) = A .* v
           @btime CUDA.@sync inner($v0, $v0)
           @btime CUDA.@sync $linearmap($v0)
           @btime CUDA.@sync λs, vs = eigsolve(v -> $linearmap(v), $v0, 1, :LM)
       end
  1.290 μs (2 allocations: 80 bytes)
  45.370 μs (5 allocations: 15.75 KiB)
  2.848 ms (2369 allocations: 1.13 MiB)
Test Summary:  | Total   Time
eigsolve Array |     0  21.3s
  16.080 μs (20 allocations: 384 bytes)
  28.660 μs (28 allocations: 608 bytes)
  28.375 ms (37781 allocations: 776.12 KiB)

My suggestion is to consider initializing U, the unitary matrix which spans the Krylov space, filling it with zeros. And update it in-place, which will eliminate all allocation. When perform Gram-Schmit, just call v = v-U^T*(U*v), thus will use BLAS-level2 instead of BLAS-level1, which will be much more friendly for GPU.

Jan 05 '25 22:01 qiyang-ustc

The eigsolve result specifically is rather confusing to me, let me profile it and see what pops up.

Mar 27 '25 10:03 kshyatt