Hello,

I'm searching a fast implementation of a GPU version of the exponentiate function. If I run this code it does not work

using LinearAlgebra
using SparseArrays
using CUDA
using CUDA.CUSPARSE
CUDA.allowscalar(false)

N = 400
a = sprand(ComplexF64, N, N, 1 / N)
A = CuSparseMatrixCSR(a)
ω = rand(ComplexF64, N)
Ω = CuArray(ω)

exponentiate(A, 0.1, Ω)

As can be seen by the commit, I remove the multithreaded function to support the GPU array.

I think that the best option could be to uncomment again that part and insert and if condition, but I don't know what kind of condition I need to insert for both CUDA sparse and dense matrices.

Sep 26 '22 23:09 albertomercurio

A possible solution could be to start using the API provided by ArrayInterface.jl

Sep 27 '22 10:09 Jutho

How do you do that?

Another possible solution could be to using the AbstractGPUArray type of the GPUArrays.jl package. Indeed, we could write and if condition, for example

if !(y isa AbstractGPUArray) &&  y isa AbstractArray && IndexStyle(y) isa IndexLinear && get_num_threads() > 1
    return unproject_linear_multithreaded!(y, b, x, α, β, r)
end

However it will work only for dense CuArrays, because for the moment the CuSparseMatrix arrays are not a subclass of this type. I will open an issue for that.

Sep 27 '22 13:09 albertomercurio

Ok I should fixed it by introducing the AbstractGPUArray type from GPUArrays.jl

However, the first time I run the solver with the GPU it returns the out-of-memory error, linking to these lines

https://github.com/Jutho/KrylovKit.jl/blob/dee29fe34c724e4c9532ea22dfb6e0464263b8c1/src/orthonormal.jl#L313-L323

But then I am able to run it everytime. So it seems related to the multithreading I think.

So it is only for the first time, then it runs perfectly.

Oct 17 '22 18:10 albertomercurio

I have not had any time for coding / package maintenance the last few weeks. I will try to find some time over the next couple of days.

Oct 18 '22 09:10 Jutho

Hi @albertomercurio , my apologies for the late response. I think your solution using AbstractGPUArray is currently the best one. However, I removed Manifest.toml from the PR. (I actually did not realise that I could immediately modify your fork of KrylovKit; I thought that would show up as a PR to your fork).

Oct 28 '22 21:10 Jutho

Codecov Report

Merging #61 (a8274d2) into master (dee29fe) will increase coverage by 0.12%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #61      +/-   ##
==========================================
+ Coverage   82.45%   82.58%   +0.12%     
==========================================
  Files          27       27              
  Lines        2753     2750       -3     
==========================================
+ Hits         2270     2271       +1     
+ Misses        483      479       -4

Impacted Files	Coverage Δ
src/KrylovKit.jl	`71.92% <ø> (-0.49%)`	:arrow_down:
src/orthonormal.jl	`87.08% <100.00%> (+1.10%)`	:arrow_up:
src/dense/givens.jl	`47.36% <0.00%> (-2.64%)`	:arrow_down:
src/adrules/linsolve.jl	`52.85% <0.00%> (+0.74%)`	:arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Oct 28 '22 22:10 codecov[bot]

Ok I should fixed it by introducing the AbstractGPUArray type from GPUArrays.jl

However, the first time I run the solver with the GPU it returns the out-of-memory error, linking to these lines

https://github.com/Jutho/KrylovKit.jl/blob/dee29fe34c724e4c9532ea22dfb6e0464263b8c1/src/orthonormal.jl#L313-L323

But then I am able to run it everytime. So it seems related to the multithreading I think.

So it is only for the first time, then it runs perfectly.

Do you have a minimal working example for this? Are you working with a krylov dimension which is such that you are almost maxing out the GPU memory. The problem with this method is that it is indeed temporarily duplicating the number krylov vectors that you need to store.

Oct 28 '22 23:10 Jutho

Do you have a minimal working example for this?

It seems that I have no more this memory saturation issue. However I used a krylov dimension 40 which should

using LinearAlgebra
using SparseArrays
using IncompleteLU
using LinearMaps
using Krylov # The master branch from GitHub
using KrylovKit
using CUDA
using CUDA.CUSPARSE
CUDA.allowscalar(false)

struct ILU0gpu{AbstractCuSparseMatrix}
    ILU0::AbstractCuSparseMatrix
end

import LinearAlgebra
function LinearAlgebra.ldiv!(y::CuArray{T}, P::ILU0gpu{CuSparseMatrixCSR{T, M}}, x::CuArray{T}) where {T,M}
    A = P.ILU0
    copyto!(y, x)
    sv2!('N', 'L', 'U', 1.0, A, y, 'O')
    sv2!('N', 'U', 'N', 1.0, A, y, 'O')
    return y
end
function LinearAlgebra.ldiv!(y::CuArray{T}, P::ILU0gpu{CuSparseMatrixCSC{T, M}}, x::CuArray{T}) where {T,M}
    A = P.ILU0
    copyto!(y, x)
    sv2!('N', 'L', 'N', 1.0, A, y, 'O')
    sv2!('N', 'U', 'U', 1.0, A, y, 'O')
    return y
end

N = 1000
elty = ComplexF64

vals0 = Array(1:N)
vecs0 = rand(elty, N, N)

A = vecs0 \ diagm(vals0) * vecs0
A = sparse(A)
# droptol!(A, 1000)
v0 = rand(elty, N)

μ = 0 # SHIFT-INVERSE
A_s = A - μ * I

A_s2 = CuSparseMatrixCSR(A_s)
v02 = CuArray(v0)
P2 = ILU0gpu(ilu02(A_s2, 'O'))
solver = GmresSolver(size(A_s2)..., 20, typeof(v02))
Lmap = LinearMap{eltype(v02)}((y, x) -> copyto!(y, gmres!(solver, A_s2, x, M = P2, ldiv = true, restart=true).x), size(A_s2, 1))
vals, vecs, info = eigsolve(Lmap, v02, 6, krylovdim=20)
show(info)
vals = (1 .+ μ * vals) ./ vals
abs.(vals)

Oct 29 '22 10:10 albertomercurio

Commented the multithreaded function for support CUDA array

Codecov Report