SciMLOperators.jl
SciMLOperators.jl copied to clipboard
make tensor products faster
with current @views
based implementation:
https://github.com/SciML/SciMLOperators.jl/blob/746c0d50a75067fae4121547a3c00bd4bbe27ce9/src/sciml.jl#L584-L618
using SciMLOperators, LinearAlgebra
using BenchmarkTools
A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))
u = rand(12^3, 100)
v = rand(12^3, 100)
A = cache_operator(A, u)
mul!(v, A, u) # dunny
@btime mul!($v, $A, $u); # 16.901 ms (348801 allocations: 45.51 MiB)
B = convert(AbstractMatrix, A)
mul!(v, B, u) # dummy
@btime mul!($v, $B, $u); # 10.337 ms (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.8.0-rc1
Commit 6368fdc6565 (2022-05-27 18:33 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin21.4.0)
CPU: 4 × Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
Threads: 4 on 4 virtual cores
Environment:
JULIA_NUM_PRECOMPILE_TASKS = 4
JULIA_DEPOT_PATH = /Users/vp/.julia
JULIA_NUM_THREADS = 4
using SciMLOperators, LinearAlgebra
using BenchmarkTools
A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))
u = [rand(12^3) for i=1:100]
v = [rand(12^3) for i=1:100]
A = cache_operator(A, u[1])
mul!(v[1], A, u[1]) # dummy
@btime for i=1:length(u); mul!($v[i], $A, $u[i]); end; # 7.333 ms (3301 allocations: 27.05 MiB)
B = convert(AbstractMatrix, A)
mul!(v[1], B, u[1]) # dummy
@btime for i=1:length(u); mul!($v[i], $B, $u[i]); end; # 129.053 ms (101 allocations: 3.16 KiB)
so applying the operator to individual vectors is indeed faster. likely because we are avoiding the views
call
@chriselrod @ChrisRackauckas can you take a look please?
i think the 'slow' part is that we are applying the operator to column vectors in u
one at a time
https://github.com/SciML/SciMLOperators.jl/blob/746c0d50a75067fae4121547a3c00bd4bbe27ce9/src/sciml.jl#L608-L610
there's definitely something im doing wrong because we are slower than LinearMaps, and LinearMaps has allocations in their mul!
.
julia> using LinearMaps, LinearAlgebra, BenchmarkTools; A = ⊗(rand(12,12), rand(12,12), rand(12,12)); u=rand(12^3,100); v=copy(u); @btime mul!($v, $A, $u);
5.142 ms (8500 allocations: 27.11 MiB)
https://github.com/JuliaLinearAlgebra/LinearMaps.jl/blob/master/src/kronecker.jl#L117-L131
@danielwe
i think the 'slow' part is that we are applying the operator to column vectors in
u
one at a timehttps://github.com/SciML/SciMLOperators.jl/blob/746c0d50a75067fae4121547a3c00bd4bbe27ce9/src/sciml.jl#L608-L610
You could try replacing that with an @turbo
or @tturbo
loop.
@turbo
didn't recognize mul!
when i checked. and L.outer
is not necessarily an AbstractMatrix
(for eg it is an AbstractSciMLOperator
we nest ops TensorProductOperator(A, B, C) === TensorProductOperator(A, TPPOp(B,C))
) so it can't necessarily be indexed. So we can't get rid of the mul!
really unless we can come up with a general implementation for TensorProductOperator
with an arbitrary number of arguments
I meant delete mul!
, replacing it with loops
You can't assume operators are indexable, it's not part of the interface (for a good reason, it may be a matrix-free operator). If you need to do that, then we can have a trait for the fast indexing (extending the ArrayInterface one), but I'm not sure that's useful since in most cases it should be matrix-free
No indexing, but view
s are okay?
We could have a special overload for certain types.
@chriselrod im not sure how view
without indexing would work.
https://github.com/SciML/SciMLOperators.jl/blob/746c0d50a75067fae4121547a3c00bd4bbe27ce9/src/sciml.jl#L608-L610
in this case, the slices C[:,:,i], V[:,:,i]
are passed into mul!
as subarrays. But L.outer
isn't guaranteed to be indexable.
in https://github.com/SciML/SciMLOperators.jl/pull/59 i've written an implementation with using permutedims!
that is faster than LinearMaps.jl
and doesn't allocate. could you take a look and suggest improvements?
No indexing, but views are okay?
Views are not okay. Views and indexing are not allowed on the operator. It's fine on the actor though. It's A*v
where A
is a linear operator which may not necessarily have an easy matrix representation. A[i,:]
may only be defined by matrix-vector multiplications (i.e. very slow), so direct calculations of A*v
is preferred. While there is MatrixOperator
, that's probably one of the only cases where we can assume it (the operator) has fast indexing.
While there is
MatrixOperator
, that's probably one of the only cases where we can assume it (the operator) has fast indexing.
What about GPU matrices? I don't think you can assume fast indexing for MatrixOperator either.
(I don't have anything to contribute with regards to the issue at hand, sorry.)
Yeah, fast_indexing
on MatrixOperator would have to check the ArrayInterface fast_indexing function.
with current implementation, https://github.com/SciML/SciMLOperators.jl/blob/c9488881763bf838898f212f3aa071e4e987bc53/src/sciml.jl#L684-L724
using SciMLOperators, LinearAlgebra, BenchmarkTools
u = rand(12^2, 100)
v = rand(12^2, 100)
A = rand(12, 12) # outer
B = rand(12, 12) # inner
T = TensorProductOperator(A, B)
T = cache_operator(T, u)
@btime mul!($v, $T, $u); # 80.416 μs (7 allocations: 304 bytes)
# reference computation - a single matvec
u = reshape(u, (12, 12*100))
v = reshape(v, (12, 12*100))
@btime mul!($v, $A, $u); # 24.005 μs (0 allocations: 0 bytes)
gotta track those allocations.
The two permutedims!
calls together take about as much time as a single mul!()
. So we should add more cases to the L.outer
matvec computation to avoid the permutedims!
wherever possible.
FYI currently we are at:
using SciMLOperators, LinearAlgebra, BenchmarkTools
A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))
u = rand(12^3, 100)
v = rand(12^3, 100)
A = cache_operator(A, u)
mul!(v, A, u) # dunny
@btime mul!($v, $A, $u); # 4.916 ms (17 allocations: 31.36 KiB)
using SciMLOperators, LinearAlgebra, BenchmarkTools
u = rand(12^2, 100)
v = rand(12^2, 100)
A = rand(12, 12) # outer
B = rand(12, 12) # inner
T = TensorProductOperator(A, B)
T = cache_operator(T, u)
@btime mul!($v, $T, $u); # 80.081 μs (7 allocations: 304 bytes)
nothing
Locally, I tried using @turbo
in one of the mul!
functions in sciml.jl
.
Master:
julia> @benchmark mul!($v, $T, $u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 44.266 μs … 123.030 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 46.153 μs ┊ GC (median): 0.00%
Time (mean ± σ): 46.537 μs ± 1.868 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▇█▇▅▃▁
▁▁▁▂▂▃▆███████▇▅▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
44.3 μs Histogram: frequency by time 54.8 μs <
Memory estimate: 304 bytes, allocs estimate: 7.
With @turbo
:
julia> @benchmark mul!($v, $T, $u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 18.194 μs … 84.525 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 19.054 μs ┊ GC (median): 0.00%
Time (mean ± σ): 19.176 μs ± 1.133 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▄▇██▇▇▄▃
▂▂▃▄▆███████████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▂ ▃
18.2 μs Histogram: frequency by time 22.9 μs <
Memory estimate: 96 bytes, allocs estimate: 3.
This was with
using SciMLOperators, LinearAlgebra, BenchmarkTools
u = rand(12^2, 100)
v = rand(12^2, 100)
A = rand(12, 12) # outer
B = rand(12, 12) # inner
T = TensorProductOperator(A, B)
T = cache_operator(T, u)
Using @tturbo
instead:
julia> @benchmark mul!($v, $T, $u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 8.935 μs … 101.421 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 9.533 μs ┊ GC (median): 0.00%
Time (mean ± σ): 9.667 μs ± 1.218 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▇██▅▃▃▂▃▄▂
▂▂▃▅█████████████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
8.94 μs Histogram: frequency by time 12.5 μs <
Memory estimate: 96 bytes, allocs estimate: 3.
About 5x faster. Note that LinearAlgebra.mul!
is also multithreaded.
The code would need dispatches set correctly to behave correctly as a function of input types.
diff --git a/src/sciml.jl b/src/sciml.jl
index 5291b1d..f482517 100644
--- a/src/sciml.jl
+++ b/src/sciml.jl
@@ -593,7 +593,7 @@ function cache_internals(L::TensorProductOperator, u::AbstractVecOrMat) where{D}
@set! L.outer = cache_operator(L.outer, uouter)
L
end
-
+using LoopVectorization
function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::AbstractVecOrMat)
@assert L.isset "cache needs to be set up for operator of type $(typeof(L)).
set up cache by calling cache_operator(L::AbstractSciMLOperator, u::AbstractArray)"
@@ -605,7 +605,6 @@ function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::Ab
perm = (2, 1, 3)
C1, C2, C3, _ = L.cache
U = _reshape(u, (ni, no*k))
-
"""
v .= kron(B, A) * u
V .= A * U * B'
@@ -619,13 +618,16 @@ function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::Ab
if L.outer isa IdentityOperator
copyto!(v, C1)
else
- C1 = _reshape(C1, (mi, no, k))
- permutedims!(C2, C1, perm)
- C2 = _reshape(C2, (no, mi*k))
- mul!(C3, L.outer, C2)
- C3 = _reshape(C3, (mo, mi, k))
- V = _reshape(v , (mi, mo, k))
- permutedims!(V, C3, perm)
+ C1 = _reshape(C1, (mi, no, k))
+ Lo = L.outer
+ V = _reshape(v , (mi, mo, k))
+ @turbo for i = 1:mi, j = 1:k, m = 1:mo
+ a = zero(Base.promote_eltype(V,Lo,C2))
+ for o = 1:no
+ a += Lo[m,o] * C1[i, o, j]
+ end
+ V[i,m,j] = a
+ end
With just this diff,
using SciMLOperators, LinearAlgebra, BenchmarkTools
A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))
u = rand(12^3, 100)
v = rand(12^3, 100)
A = cache_operator(A, u)
mul!(v, A, u) # dunny
fails, for example.
I could prepare a PR, or someone else more familiar with all the dispatches needed to only funnel valid types to @turbo
methods.
The approach is simply to replace reshape
+ permutedims!
+ mul!
with a single @turbo
loop that does it all.
new idea: try using PermutedDimsArray
instead of permutedims
/ permutedims!