CUDA.jl icon indicating copy to clipboard operation
CUDA.jl copied to clipboard

[Feature Request] Determinant Function

Open MasonProtter opened this issue 6 years ago • 10 comments

We currently don't have a way to take determinants of matrices on the GPU as far as I know. It'd be really nice if we could do that.

MasonProtter avatar Nov 14 '19 02:11 MasonProtter

Came up here again: https://discourse.julialang.org/t/determinant-of-cuda-matrix/77245/3

I mentioned

julia> M = rand(10,10);

julia> Mgpu = cu(M);

julia> det(M) # cpu
-0.10243723110926993

julia> prod(diag(LinearAlgebra.LAPACK.getrf!(M)[1])) # cpu
-0.10243723110926993

julia> prod(diag(CUSOLVER.getrf!(Mgpu)[1])) # gpu
-0.10243723f0

But I think the sign is uncertain here!

carstenbauer avatar Mar 01 '22 15:03 carstenbauer

Indeed, there is a sign issue. I get a negative determinant for a PSD matrix:

LinearAlgebra.det(m::CuMatrix) = prod(diag(CUSOLVER.getrf!(m)[1])) #Shameless type piracy
A = rand(10,10)
S = A*A'
S_d = cu(Σ)
det(S) == -det(S_d) #true

Since I only need it for PSD matrices, I can work it around by adding taking the abs. I can't help more though; I don't have a strong background in LA.

HenriDeh avatar Mar 01 '22 17:03 HenriDeh

But I think the sign is uncertain here!

You have to multiply by the sign of the permutation. Something like:

function det!(A::CuMatOrAdj)
    A, ipiv = CUSOLVER.getrf!(A)
    return prod(diag(A)) * (isodd(count(((k,i),) -> k != i, enumerate(ipiv))) ? -1 : 1)
end
LinearAlgebra.det(A::CuMatOrAdj) = det!(copy(A))

PS. This is not "type piracy" because the CuMatOrAdj type is specific to this package.

stevengj avatar Mar 01 '22 17:03 stevengj

Note that you'll also want to provide logdet and logabsdet, which can be computed with much the same code (you just take the sum of the logs of the |diagonals| instead of the product, and keep track of the signs).

stevengj avatar Mar 01 '22 17:03 stevengj

Was going to suggest the same as Steven, i.e. multiplying by the sign of the permutation:

My draft implementation was

function det!(m::CuMatOrAdj)
   X_d, ipiv_d = CUSOLVER.getrf!(m)
   diags = Array(diag(X_d))
   ipiv = Array(ipiv_d)
   det = one(eltype(diags))
   @inbounds for i in 1:length(ipiv)
       det *= diags[i]
       if i != ipiv[i]
           det = -det
       end
   end
   return det
end
LinearAlgebra.det(A::CuMatOrAdj) = det!(copy(A))

I guess the only question is how to most efficiently implement this.

BTW, do we need a CUDA.@sync to be safe?

carstenbauer avatar Mar 01 '22 17:03 carstenbauer

PS. This is not "type piracy" because the CuMatOrAdj type is specific to this package.

I guess he meant that this is type piracy in his code (i.e. outside of CUDA.jl).

carstenbauer avatar Mar 01 '22 18:03 carstenbauer

PS. This is not "type piracy" because the CuMatOrAdj type is specific to this package.

I guess he meant that this is type piracy in his code (i.e. outside of CUDA.jl).

Yes, it's a copy-paste from my code, where it is piracy.

HenriDeh avatar Mar 01 '22 19:03 HenriDeh

All of the implementations proposed here are much slower than a CPU equivalent I fear. I made this benchmark where I compute the logdet without needing the LU decomposition (I assume it was done before):

using CUDA,LinearAlgebra,BenchmarkTools

L = LowerTriangular(rand(32,32)) 
L_d = L |> cu

#Given L or U, compute logdet of L*U
function mylogdetLorU(LorU::Union{LowerTriangular{<:Number, <:CuArray}, UpperTriangular{<:Number, <:CuArray}})
    return 2*mapreduce(log, +, diag(LorU), init = 0f0) #fastest I could find
end

@btime mylogdetLorU($L_d) #85.6 μs
@btime 2*logdet($L) #186.18 ns

I find identical performance with the above implementations of det! (from which I strip off the getrf! since in my case it's already done).

I think they allocate too much (the CPU implementation does not allocate at all).

HenriDeh avatar Mar 29 '22 09:03 HenriDeh

I have a logdet implementation inspired by @carstenbauer's code above, and it's faster than the CPU implementation for the case I care about:

function logdet!(m)
   X_d, ipiv_d = CUSOLVER.getrf!(m)
   ipiv = Array(ipiv_d)

   ldet = sum(log, diag(X_d))
   ldet += im * π * get_count(ipiv)
   
   return real(ldet) + im*(imag(ldet) % 2π)
end

function get_count(ipiv)
    return @inbounds count(i -> isequal(i, ipiv[i]), 1:length(ipiv))
end

my_logdet(A) = logdet!(copy(A))

Timing it:

julia> A = randn(ComplexF64, 1024, 1024);

julia> A_d = CuArray(A);

julia> @belapsed logdet($A)
0.03181127

julia> @belapsed my_logdet($A_d)
0.007738641

julia> isapprox(logdet(A), my_logdet(A_d); rtol = 1e-8)
true

get_count is non-allocating, but we could do it with a mapreduce or similar. You should be able to do the same thing for det and determine the sign similarly. My only concern about this is we still need to pull the ipiv array to the CPU, is there a nice built-in way to do it on GPU?

SBuercklin avatar May 17 '22 21:05 SBuercklin

I guess the overhead of diag allocating on the GPU is insignificant when one gets the GPU speed up for the decomposition. My use case is probably too specific to deserve an investigation here.

HenriDeh avatar May 18 '22 09:05 HenriDeh