SciMLOperators.jl icon indicating copy to clipboard operation
SciMLOperators.jl copied to clipboard

make tensor products faster

Open vpuri3 opened this issue 2 years ago • 22 comments

with current @views based implementation:

https://github.com/SciML/SciMLOperators.jl/blob/746c0d50a75067fae4121547a3c00bd4bbe27ce9/src/sciml.jl#L584-L618

using SciMLOperators, LinearAlgebra
using BenchmarkTools

A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))

u = rand(12^3, 100)
v = rand(12^3, 100)

A = cache_operator(A, u)

mul!(v, A, u) # dunny
@btime mul!($v, $A, $u); # 16.901 ms (348801 allocations: 45.51 MiB)

B = convert(AbstractMatrix, A)
mul!(v, B, u) # dummy
@btime mul!($v, $B, $u); # 10.337 ms (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.8.0-rc1
Commit 6368fdc6565 (2022-05-27 18:33 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.4.0)
  CPU: 4 × Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 4 on 4 virtual cores
Environment:
  JULIA_NUM_PRECOMPILE_TASKS = 4
  JULIA_DEPOT_PATH = /Users/vp/.julia
  JULIA_NUM_THREADS = 4

vpuri3 avatar Jun 17 '22 12:06 vpuri3

using SciMLOperators, LinearAlgebra
using BenchmarkTools

A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))

u = [rand(12^3) for i=1:100]
v = [rand(12^3) for i=1:100]

A = cache_operator(A, u[1])

mul!(v[1], A, u[1]) # dummy
@btime for i=1:length(u); mul!($v[i], $A, $u[i]); end; # 7.333 ms (3301 allocations: 27.05 MiB)

B = convert(AbstractMatrix, A)
mul!(v[1], B, u[1]) # dummy
@btime for i=1:length(u); mul!($v[i], $B, $u[i]); end; # 129.053 ms (101 allocations: 3.16 KiB)

so applying the operator to individual vectors is indeed faster. likely because we are avoiding the views call

vpuri3 avatar Jun 17 '22 12:06 vpuri3

@chriselrod @ChrisRackauckas can you take a look please?

vpuri3 avatar Jun 17 '22 13:06 vpuri3

i think the 'slow' part is that we are applying the operator to column vectors in u one at a time https://github.com/SciML/SciMLOperators.jl/blob/746c0d50a75067fae4121547a3c00bd4bbe27ce9/src/sciml.jl#L608-L610

vpuri3 avatar Jun 17 '22 14:06 vpuri3

there's definitely something im doing wrong because we are slower than LinearMaps, and LinearMaps has allocations in their mul!.

julia> using LinearMaps, LinearAlgebra, BenchmarkTools; A = ⊗(rand(12,12), rand(12,12), rand(12,12)); u=rand(12^3,100); v=copy(u); @btime mul!($v, $A, $u);
  5.142 ms (8500 allocations: 27.11 MiB)

https://github.com/JuliaLinearAlgebra/LinearMaps.jl/blob/master/src/kronecker.jl#L117-L131

vpuri3 avatar Jun 17 '22 14:06 vpuri3

@danielwe

vpuri3 avatar Jun 17 '22 14:06 vpuri3

i think the 'slow' part is that we are applying the operator to column vectors in u one at a time

https://github.com/SciML/SciMLOperators.jl/blob/746c0d50a75067fae4121547a3c00bd4bbe27ce9/src/sciml.jl#L608-L610

You could try replacing that with an @turbo or @tturbo loop.

chriselrod avatar Jun 17 '22 19:06 chriselrod

@turbo didn't recognize mul! when i checked. and L.outer is not necessarily an AbstractMatrix (for eg it is an AbstractSciMLOperator we nest ops TensorProductOperator(A, B, C) === TensorProductOperator(A, TPPOp(B,C))) so it can't necessarily be indexed. So we can't get rid of the mul! really unless we can come up with a general implementation for TensorProductOperator with an arbitrary number of arguments

vpuri3 avatar Jun 17 '22 20:06 vpuri3

I meant delete mul!, replacing it with loops

chriselrod avatar Jun 18 '22 00:06 chriselrod

You can't assume operators are indexable, it's not part of the interface (for a good reason, it may be a matrix-free operator). If you need to do that, then we can have a trait for the fast indexing (extending the ArrayInterface one), but I'm not sure that's useful since in most cases it should be matrix-free

ChrisRackauckas avatar Jun 18 '22 00:06 ChrisRackauckas

No indexing, but views are okay?

We could have a special overload for certain types.

chriselrod avatar Jun 18 '22 00:06 chriselrod

@chriselrod im not sure how view without indexing would work.

https://github.com/SciML/SciMLOperators.jl/blob/746c0d50a75067fae4121547a3c00bd4bbe27ce9/src/sciml.jl#L608-L610

in this case, the slices C[:,:,i], V[:,:,i] are passed into mul! as subarrays. But L.outer isn't guaranteed to be indexable.

vpuri3 avatar Jun 18 '22 00:06 vpuri3

in https://github.com/SciML/SciMLOperators.jl/pull/59 i've written an implementation with using permutedims! that is faster than LinearMaps.jl and doesn't allocate. could you take a look and suggest improvements?

vpuri3 avatar Jun 18 '22 00:06 vpuri3

No indexing, but views are okay?

Views are not okay. Views and indexing are not allowed on the operator. It's fine on the actor though. It's A*v where A is a linear operator which may not necessarily have an easy matrix representation. A[i,:] may only be defined by matrix-vector multiplications (i.e. very slow), so direct calculations of A*v is preferred. While there is MatrixOperator, that's probably one of the only cases where we can assume it (the operator) has fast indexing.

ChrisRackauckas avatar Jun 18 '22 02:06 ChrisRackauckas

While there is MatrixOperator, that's probably one of the only cases where we can assume it (the operator) has fast indexing.

What about GPU matrices? I don't think you can assume fast indexing for MatrixOperator either.

(I don't have anything to contribute with regards to the issue at hand, sorry.)

danielwe avatar Jun 18 '22 04:06 danielwe

Yeah, fast_indexing on MatrixOperator would have to check the ArrayInterface fast_indexing function.

ChrisRackauckas avatar Jun 18 '22 04:06 ChrisRackauckas

with current implementation, https://github.com/SciML/SciMLOperators.jl/blob/c9488881763bf838898f212f3aa071e4e987bc53/src/sciml.jl#L684-L724

using SciMLOperators, LinearAlgebra, BenchmarkTools

u = rand(12^2, 100)
v = rand(12^2, 100)

A = rand(12, 12) # outer 
B = rand(12, 12) # inner

T = TensorProductOperator(A, B)
T = cache_operator(T, u)

@btime mul!($v, $T, $u); # 80.416 μs (7 allocations: 304 bytes)


# reference computation - a single matvec
u = reshape(u, (12, 12*100))
v = reshape(v, (12, 12*100))
@btime mul!($v, $A, $u); # 24.005 μs (0 allocations: 0 bytes)

gotta track those allocations.

vpuri3 avatar Jun 18 '22 14:06 vpuri3

The two permutedims! calls together take about as much time as a single mul!(). So we should add more cases to the L.outer matvec computation to avoid the permutedims! wherever possible.

vpuri3 avatar Jun 20 '22 13:06 vpuri3

FYI currently we are at:

using SciMLOperators, LinearAlgebra, BenchmarkTools

A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))

u = rand(12^3, 100)
v = rand(12^3, 100)

A = cache_operator(A, u)

mul!(v, A, u) # dunny
@btime mul!($v, $A, $u); # 4.916 ms (17 allocations: 31.36 KiB)

using SciMLOperators, LinearAlgebra, BenchmarkTools

u = rand(12^2, 100)
v = rand(12^2, 100)

A = rand(12, 12) # outer 
B = rand(12, 12) # inner

T = TensorProductOperator(A, B)
T = cache_operator(T, u)

@btime mul!($v, $T, $u); # 80.081 μs (7 allocations: 304 bytes)

nothing

vpuri3 avatar Jun 21 '22 15:06 vpuri3

Locally, I tried using @turbo in one of the mul! functions in sciml.jl. Master:

julia> @benchmark mul!($v, $T, $u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  44.266 μs … 123.030 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     46.153 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.537 μs ±   1.868 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▄▇█▇▅▃▁
  ▁▁▁▂▂▃▆███████▇▅▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  44.3 μs         Histogram: frequency by time         54.8 μs <

 Memory estimate: 304 bytes, allocs estimate: 7.

With @turbo:

julia> @benchmark mul!($v, $T, $u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.194 μs … 84.525 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.054 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.176 μs ±  1.133 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▃▄▇██▇▇▄▃
  ▂▂▃▄▆███████████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▂ ▃
  18.2 μs         Histogram: frequency by time        22.9 μs <

 Memory estimate: 96 bytes, allocs estimate: 3.

This was with

using SciMLOperators, LinearAlgebra, BenchmarkTools

u = rand(12^2, 100)
v = rand(12^2, 100)

A = rand(12, 12) # outer 
B = rand(12, 12) # inner

T = TensorProductOperator(A, B)
T = cache_operator(T, u)

Using @tturbo instead:

julia> @benchmark mul!($v, $T, $u)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  8.935 μs … 101.421 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.533 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.667 μs ±   1.218 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▄▇██▅▃▃▂▃▄▂
  ▂▂▃▅█████████████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  8.94 μs         Histogram: frequency by time        12.5 μs <

 Memory estimate: 96 bytes, allocs estimate: 3.

About 5x faster. Note that LinearAlgebra.mul! is also multithreaded.

chriselrod avatar Jun 21 '22 18:06 chriselrod

The code would need dispatches set correctly to behave correctly as a function of input types.

diff --git a/src/sciml.jl b/src/sciml.jl
index 5291b1d..f482517 100644
--- a/src/sciml.jl
+++ b/src/sciml.jl
@@ -593,7 +593,7 @@ function cache_internals(L::TensorProductOperator, u::AbstractVecOrMat) where{D}
     @set! L.outer = cache_operator(L.outer, uouter)
     L
 end
-
+using LoopVectorization
 function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::AbstractVecOrMat)
     @assert L.isset "cache needs to be set up for operator of type $(typeof(L)).
     set up cache by calling cache_operator(L::AbstractSciMLOperator, u::AbstractArray)"
@@ -605,7 +605,6 @@ function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::Ab
     perm = (2, 1, 3)
     C1, C2, C3, _ = L.cache
     U = _reshape(u, (ni, no*k))
-
     """
         v .= kron(B, A) * u
         V .= A * U * B'
@@ -619,13 +618,16 @@ function LinearAlgebra.mul!(v::AbstractVecOrMat, L::TensorProductOperator, u::Ab
         if L.outer isa IdentityOperator
             copyto!(v, C1)
         else
-            C1 = _reshape(C1, (mi, no, k))
-            permutedims!(C2, C1, perm)
-            C2 = _reshape(C2, (no, mi*k))
-            mul!(C3, L.outer, C2)
-            C3 = _reshape(C3, (mo, mi, k))
-            V  = _reshape(v , (mi, mo, k))
-            permutedims!(V, C3, perm)
+          C1 = _reshape(C1, (mi, no, k))
+          Lo = L.outer
+          V  = _reshape(v , (mi, mo, k))
+          @turbo for i = 1:mi, j = 1:k, m = 1:mo
+            a = zero(Base.promote_eltype(V,Lo,C2))
+            for o = 1:no
+              a += Lo[m,o] * C1[i, o, j]
+            end
+            V[i,m,j] = a
+          end

With just this diff,

using SciMLOperators, LinearAlgebra, BenchmarkTools

A = TensorProductOperator(rand(12,12), rand(12,12), rand(12,12))

u = rand(12^3, 100)
v = rand(12^3, 100)

A = cache_operator(A, u)

mul!(v, A, u) # dunny

fails, for example.

chriselrod avatar Jun 21 '22 18:06 chriselrod

I could prepare a PR, or someone else more familiar with all the dispatches needed to only funnel valid types to @turbo methods.

The approach is simply to replace reshape + permutedims! + mul! with a single @turbo loop that does it all.

chriselrod avatar Jun 21 '22 18:06 chriselrod

new idea: try using PermutedDimsArray instead of permutedims/ permutedims!

vpuri3 avatar Jun 05 '23 14:06 vpuri3