LuxurySparse.jl
LuxurySparse.jl copied to clipboard
Sparse Kronecker sums
Right now, I'm working on in-place, CPU versions of kronsum(A,B) = kron(A, oneunit(B)) + kron(oneunit(A), B)
So far, this works for dense arrays:
function kronsum!(C::AbstractMatrix, A::AbstractMatrix, B::AbstractMatrix)
Base.require_one_based_indexing(A, B)
(mA, nA) = size(A); (mB, nB) = size(B); (mC, nC) = size(C);
@boundscheck (mC, nC) == (mA * mB, nA * nB) ||
throw(DimensionMismatch("expect C to be a $(mA * mB)x$(nA * nB) matrix, got size $(mC)x$(nC)"))
C .= 0
m = 1
@inbounds for j = 1:nA
for l = 1:mB
for k = 1:mA
C[m] = A[k,j]
m += nB
end
m += 1
end
m -= nB
end
m = 1
@inbounds for j = 1:nA
for k = 1:nB
for l = 1:mB
C[m] += B[l,k]
m += 1
end
m += (nA - 1) * mB
end
m += mB
end
return C
end
A = rand(3,3); B = rand(2,2); C = zeros(6,6);
C == kron(A, oneunit(B)) + kron(oneunit(A), B) # true
I'm having trouble doing something similar with SparseMatrixCSC, since parts of A and B only overlap on the diagonal of C.
Using Diagonal Format (from DIA.jl) could work, but I think converting to CSC would take too much time (especially if either A or B are sparse) and there's the risk of storing too many zeroes for the diagonal storage Vectors.
Currently, I'm inspired by the sparse kron! in https://github.com/JuliaLang/julia/pull/31069. It works for either kron(A,oneunit(B)) or kron(oneunit(A), B), but not the sum.
- Could I use two loops like I did for the dense arrays? (One nested loop for
A, the other forB) - Can I somehow efficiently check if an index points to a diagonal term of
CwhereC isa SparseMatrixCSC?
have you tried to use kron + IMatrix from LuxurySparse? this is how we calculate our own Hamiltonian matrices in Yao at the moment. I'm curious if kronsum! will be faster than it and how much speedup could it be. (let's say in dense matrix case)
To get some feelings about which part to optimize
julia> a = sprand(500, 500, 0.02);
julia> m = kron(a, IMatrix{500}());
julia> n = kron(IMatrix{500}(), a);
julia> @benchmark $m + $n
BenchmarkTools.Trial:
memory estimate: 78.57 MiB
allocs estimate: 8
--------------
minimum time: 43.870 ms (0.69% GC)
median time: 69.541 ms (7.15% GC)
mean time: 72.327 ms (10.12% GC)
maximum time: 160.216 ms (61.05% GC)
--------------
samples: 70
evals/sample: 1
julia> @benchmark kron($a, IMatrix{500}())
BenchmarkTools.Trial:
memory estimate: 40.24 MiB
allocs estimate: 8
--------------
minimum time: 7.091 ms (0.00% GC)
median time: 8.242 ms (0.00% GC)
mean time: 8.540 ms (2.21% GC)
maximum time: 14.998 ms (4.26% GC)
--------------
samples: 585
evals/sample: 1
julia> @benchmark kron(IMatrix{500}(), $a)
BenchmarkTools.Trial:
memory estimate: 59.40 MiB
allocs estimate: 10
--------------
minimum time: 7.096 ms (0.00% GC)
median time: 9.283 ms (4.30% GC)
mean time: 11.296 ms (6.95% GC)
maximum time: 25.206 ms (3.04% GC)
--------------
samples: 443
evals/sample: 1
julia> @benchmark copy($res)
BenchmarkTools.Trial:
memory estimate: 77.62 MiB
allocs estimate: 8
--------------
minimum time: 37.340 ms (0.00% GC)
median time: 55.627 ms (0.00% GC)
mean time: 58.335 ms (7.91% GC)
maximum time: 72.644 ms (20.63% GC)
--------------
samples: 86
evals/sample: 1
So, the question is whether it is possible that kronsum! be faster than SparseMatrixCSC "+" operation in Base. It should is possible if you can preallocation the output Matrix, because most of time are spent in memory allocation.
So, the question is whether it is possible that
kronsum!be faster than SparseMatrixCSC "+" operation in Base. It should is possible if you can preallocation the output Matrix, because most of time are spent in memory allocation.
In Kronecker.jl, the (out-of-place) equivalent of kronsum uses SparseMatrixCSC +:
"""
collect(K::AbstractKroneckerSum)
Collects a lazy instance of the `AbstractKroneckerSum` type into a full,
native matrix. Returns the result as a sparse matrix.
"""
function Base.collect(K::AbstractKroneckerSum)
A, B = getmatrices(sparse(K))
IA, IB = oneunit(A), oneunit(B)
return kron(A, IB) + kron(IA, B)
end
"""
SparseArrays.sparse(K::KroneckerSum)
Creates a lazy instance of a `KroneckerSum` type with sparse
matrices. If the matrices are already sparse, `K` is returned.
"""
function SparseArrays.sparse(K::KroneckerSum{T, TA, TB}) where {T, TA <: AbstractSparseMatrix, TB <: AbstractSparseMatrix}
return K
end
function SparseArrays.sparse(K::KroneckerSum{T, TA, TB}) where {T, TA <: AbstractMatrix, TB <: AbstractMatrix}
return sparse(K.A) ⊕ sparse(K.B)
end
For my usecase, I would preallocate the output matrix C with kron once, then update the non-zero values of A and B (assuming they have a fixed sparsity structure).
julia> A, B = (sprand(32, 32, 0.1) for _ in 1:2);
julia> IA, IB = (IMatrix(X) for X in (A,B));
julia> C = kron(A, IB) + kron(IA, B); C.nzval .= 0;
...
When the sparsity structure is fixed, a batched_kron! implementation for batches of sparse matrices would be very nice. One would probably need a new type:
struct BatchSparseMatrixCSC{Tv,Ti<:Integer} <: AbstractSparseArray{Tv,Ti,3} end
have you tried to use
kron+IMatrixfrom LuxurySparse? this is how we calculate our own Hamiltonian matrices in Yao at the moment. I'm curious ifkronsum!will be faster than it and how much speedup could it be. (let's say in dense matrix case)
julia> A, B = (sprand(100, 100, 0.1) for _ in 1:2);
julia> IA, IB = (IMatrix(X) for X in (A,B));
julia> C = kron(A, IB) + kron(IA, B); C.nzval .= 0;
julia> @benchmark kronsum!($C, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 65.690 ms (0.00% GC)
median time: 66.072 ms (0.00% GC)
mean time: 66.199 ms (0.00% GC)
maximum time: 67.885 ms (0.00% GC)
--------------
samples: 76
evals/sample: 1
julia> @benchmark kron($A, $IB) + kron($IA, $B)
BenchmarkTools.Trial:
memory estimate: 6.88 MiB
allocs estimate: 26
--------------
minimum time: 1.381 ms (0.00% GC)
median time: 1.427 ms (0.00% GC)
mean time: 1.479 ms (2.43% GC)
maximum time: 2.693 ms (31.06% GC)
--------------
samples: 3364
evals/sample: 1
julia> C == kron(A,IB) + kron(IA, B)
true
IMatrix appears to be slightly faster than oneunit, but I have yet to take the construction of both matrices into account.
julia> unA, unB = (oneunit(X) for X in (A,B));
julia> @benchmark kron($A, $unB) + kron($unA, $B)
BenchmarkTools.Trial:
memory estimate: 6.15 MiB
allocs estimate: 24
--------------
minimum time: 1.922 ms (0.00% GC)
median time: 1.968 ms (0.00% GC)
mean time: 2.012 ms (1.59% GC)
maximum time: 2.960 ms (26.67% GC)
--------------
samples: 2475
evals/sample: 1
If the matrices are dense: using IMatrix and + is much faster, but kronsum! is still allocation-free.
julia> A, B = (rand(40, 40) for _ in 1:2);
julia> IA, IB = (IMatrix(X) for X in (A,B));
julia> C = kron(A, IB) + kron(IA, B); C.nzval .= 0;
julia> @benchmark kronsum!($C, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.664 ms (0.00% GC)
median time: 3.728 ms (0.00% GC)
mean time: 3.746 ms (0.00% GC)
maximum time: 5.585 ms (0.00% GC)
--------------
samples: 1334
evals/sample: 1
julia> @benchmark kron($A, $IB) + kron($IA, $B)
BenchmarkTools.Trial:
memory estimate: 3.94 MiB
allocs estimate: 21
--------------
minimum time: 710.607 μs (0.00% GC)
median time: 733.885 μs (0.00% GC)
mean time: 768.422 μs (2.74% GC)
maximum time: 1.837 ms (44.32% GC)
--------------
samples: 6452
evals/sample: 1
julia> unA, unB = (oneunit(X) for X in (A,B));
julia> @benchmark kron($A, $unB) + kron($unA, $B)
BenchmarkTools.Trial:
memory estimate: 58.59 MiB
allocs estimate: 6
--------------
minimum time: 9.340 ms (0.00% GC)
median time: 9.791 ms (0.00% GC)
mean time: 9.924 ms (2.67% GC)
maximum time: 10.843 ms (7.24% GC)
--------------
samples: 504
evals/sample: 1
IMatrix actually do not have any allocation, and kron creates a sparse matrix when there is a IMatrix, thus I believe the allocation could be fine. We could have some kron! interface support for these matrices, so you could preallocate these as well.
NOTE: oneunit allocates an entire dense array and the result of kron will be in Array as well, while IMatrix could allocate nothing and produce SparseMatrixCSC.
Could I use two loops like I did for the dense arrays? (One nested loop for A, the other for B)
no, setindex! for CSC format is not efficient, you have to generate the colptr etc. directly which would be the same as what's in Base.