LoopVectorization.jl icon indicating copy to clipboard operation
LoopVectorization.jl copied to clipboard

StackOverflowError in BitMatrix * vector multiplication even for n=10

Open biona001 opened this issue 4 years ago • 8 comments

I was hoping that this package would work for BitMatrix-vector multiplication, which is not supported by BLAS. However, my code encounters a StackOverflowError error:

using LoopVectorization
using Random
function gemv_avx!(c, A, b)
    @avx for j in 1:size(A, 2), i in 1:size(A, 1)
        c[i] += A[i, j] * b[j]
    end
end
Random.seed!(2020)
n = 10
A = bitrand(n, n)
b = rand(n)
c_avx = zeros(n)

julia> gemv_avx!(c_avx, A, b)
ERROR: StackOverflowError:
Stacktrace:
 [1] vfmadd_fast(::VectorizationBase.Mask{8,UInt8}, ::VectorizationBase.SVec{8,Float64}, ::VectorizationBase.SVec{8,Float64}) at /Users/biona001/.julia/packages/SIMDPirates/ig1yt/src/SIMDPirates.jl:103 (repeats 80000 times)

Issue 61 seems to be related. Did I do something dumb here?

biona001 avatar Mar 31 '20 18:03 biona001

Issue 61 seems to be related. Did I do something dumb here?

You didn't do anything dumb. It's the same sort of issue. Some methods weren't defined, causing the promotion code to get in an infinite loop and eventually stack overflow. I'll have to come up with a smarter way of doing the promotions.

But for now, I defined a few missing methods in some of LoopVectorization's dependencies, and also added bit operations to the gemv tests and to the gemm tests.

I'll test with AVX2 later, but it seems fast with AVX512 at least:

julia> using LoopVectorization, Random

julia> B = rand(10,15); C = zeros(10,15);

julia> n = 10
10

julia> A = bitrand(n, n);

julia> function gemm_kernel!(C, A, B)
           @avx for n ∈ 1:size(A, 1), m ∈ 1:size(B, 2)
               Cₙₘ = zero(eltype(C))
               for k ∈ 1:size(A, 2)
                   Cₙₘ += A[n,k] * B[k,m]
               end
               C[n,m] = Cₙₘ
           end
       end
gemm_kernel! (generic function with 1 method)

julia> using BenchmarkTools, LinearAlgebra

julia> @benchmark gemm_kernel!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     90.435 ns (0.00% GC)
  median time:      91.108 ns (0.00% GC)
  mean time:        91.220 ns (0.00% GC)
  maximum time:     132.584 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     956

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  224 bytes
  allocs estimate:  4
  --------------
  minimum time:     1.263 μs (0.00% GC)
  median time:      1.272 μs (0.00% GC)
  mean time:        1.296 μs (0.00% GC)
  maximum time:     2.305 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> function gemm_kernel_noavx!(C, A, B)
           @inbounds @fastmath for n ∈ 1:size(A, 1), m ∈ 1:size(B, 2)
               Cₙₘ = zero(eltype(C))
               @simd for k ∈ 1:size(A, 2)
                   Cₙₘ += A[n,k] * B[k,m]
               end
               C[n,m] = Cₙₘ
           end
       end
gemm_kernel_noavx! (generic function with 1 method)

julia> @benchmark gemm_kernel_noavx!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.030 μs (0.00% GC)
  median time:      2.031 μs (0.00% GC)
  mean time:        2.034 μs (0.00% GC)
  maximum time:     4.757 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> function gemm_kernel_noavx!(C, A, B)
           @inbounds @fastmath for n ∈ 1:size(A, 1), m ∈ 1:size(B, 2)
               Cₙₘ = zero(eltype(C))
               for k ∈ 1:size(A, 2)
                   Cₙₘ += A[n,k] * B[k,m]
               end
               C[n,m] = Cₙₘ
           end
       end
gemm_kernel_noavx! (generic function with 1 method)

julia> @benchmark gemm_kernel_noavx!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.493 μs (0.00% GC)
  median time:      1.496 μs (0.00% GC)
  mean time:        1.499 μs (0.00% GC)
  maximum time:     3.876 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

Codegen could be garbage with AVX2 for all I know, so I'll take a look at it and run benchmarks later.

If you want it to be fast, even for large matrices (several hundred rows and columns), you'd have to follow an approach like this one, where you add extra layers of blocking for better cache efficiency.

I'll clean up the code, add tests, and register that package eventually. I should also get around to adding multi threading support.

Please file new issues if you run into any more problems.

chriselrod avatar Mar 31 '20 21:03 chriselrod

Thank you so much for the lighting response time! I benchmarked a few (large) bitmatrix times dense Float64 vectors:

function gemv_naive!(c, A, b)
    @inbounds for j in 1:size(A, 2)
        @simd for i in 1:size(A, 1)
            c[i] += A[i, j] * b[j]
        end
    end
end

function gemv_avx!(c, A, b)
    @avx for j in 1:size(A, 2), i in 1:size(A, 1)
        c[i] += A[i, j] * b[j]
    end
end

function gemv_avx2!(c, A, b)
    @avx for i in 1:size(A, 1)
        ci = zero(eltype(c))
        for j in 1:size(A, 2)
            ci += A[i, j] * b[j]
        end
        c[i] = ci
    end
end

Random.seed!(2020)
n = 1000
A = bitrand(n, n)
b = rand(n)
c = zeros(n)
c_avx = zeros(n)
c_avx2 = zeros(n)
c_true = zeros(n)

# check correctness
mul!(c_true, A, b)
gemv_avx!(c_avx, A, b)
gemv_avx2!(c_avx2, A, b)
gemv_naive!(c, A, b)
[c c_avx c_avx2 c_true]

# efficiency (A = bitmatrix, 1000x1000)
@benchmark mul!(c_true, A, b)      # 3.411 ms     (Julia's default)
@benchmark gemv_naive!(c, A, b)    # 4.230 ms     (Looping using @simd and @inbounds)
@benchmark gemv_avx!(c_avx, A, b)  # 566.309 μs   (using LoopVectorization)
@benchmark gemv_avx2!(c_avx, A, b) # 572.952 μs   (using LoopVectorization with accumulator)

# efficiency (A = bitmatrix, 10000x10000)
@benchmark mul!(c_true, A, b)      # 341.411 ms  (Julia's default)
@benchmark gemv_naive!(c, A, b)    # 424.198 ms  (Looping using @simd and @inbounds)
@benchmark gemv_avx!(c_avx, A, b)  # 77.201 ms   (using LoopVectorization)
@benchmark gemv_avx2!(c_avx, A, b) # 73.227 ms   (using LoopVectorization with accumulator)

# efficiency (A = bitmatrix, 50000x50000)
@benchmark mul!(c_true, A, b)      # 8.999 s   (Julia's default)
@benchmark gemv_naive!(c, A, b)    # 10.207 s  (Looping using @simd and @inbounds)
@benchmark gemv_avx!(c_avx, A, b)  # 2.685 s   (using LoopVectorization)
@benchmark gemv_avx2!(c_avx, A, b) # 2.197 s   (using LoopVectorization with accumulator)

# efficiency (A = bitmatrix, 100000x100000)
@time mul!(c_true, A, b)      # 37.167032 s   (Julia's default)
@time gemv_naive!(c, A, b)    # 42.665357 s   (Looping using @simd and @inbounds)
@time gemv_avx!(c_avx, A, b)  # 17.452804 s   (using LoopVectorization)
@time gemv_avx2!(c_avx, A, b) # 17.881693 s   (using LoopVectorization with accumulator)

LoopVectorization appears to win in every scenario! If cache efficiency is optimized as you suggested, how much do you think this will improve by? This is wonder as it is already!

biona001 avatar Mar 31 '20 22:03 biona001

Thanks for the benchmarks!

Notice that gemv requires O(N^2) calculations (e.g., size(A,1) dot products of length size(A,2)). Meaning going from 1_000 x 1_000 to 10_000 x 10_000 should take 100x longer, but it's instead around 125x slower. The 100_000 x 100_000 case "should" have taken less than 6 seconds.

Memory bandwidth is a much bigger problem for gemv than it is gemm (which has the advantage of requiring O(N^3) computation but only O(N^2) memory), but a bitmatrix taking up so much less memory probably helps here.

Here's a bare-bones implementation that doesn't "pack" (copy the blocks of A into another array, so that it's closer together), and doesn't handle remainders beyond the block size of 512 x 512.

using VectorizationBase: gesp, stridedpointer
function gemv_avx_512by512!(c, A, b)
    @avx for j in 1:512, i in 1:512
        c[i] += A[i, j] * b[j]
    end
end
function gemv_tile!(c, A, b)
    M, N = size(A)
    Miter = M >>> 9 # fast div(M, 512)
    Mrem = M & 511 # fast rem(M, 512)
    Niter = N >>> 9
    Nrem = N & 511
    GC.@preserve c A b for n in 0:Niter-1
        for m in 0:Miter-1
            gemv_avx_512by512!(
                gesp(stridedpointer(c), (512m,)),
                gesp(stridedpointer(A), (8m, 8n)),
                gesp(stridedpointer(b), (512n,))
            )
        end
        # TODO: handle mrem
    end
    # TODO: handle nrem
end

The reason for the specific 512 x 512 function is because the stridedpointers don't keep track of size, and they were my lazy way of implementing a non-allocating view so that I could multiply it in blocks.

If you pursue this, odds are you'll run into more missing methods or corner cases not handled correctly by existing methods. Please don't hesitate to file more issues or open PRs. I need to improve the BitArray support.

Anyway, performance benefit:

julia> n = 1 << 14 # multiple of 512, because we're not handling remainders
16384

julia> A = bitrand(n, n);

julia> b = rand(n); c1 = zero(b); c2 = zero(b);

julia> gemv_tile!(c1, A, b);

julia> gemv_avx!(c2, A, b);

julia> c1 ≈ c2
true

julia> @benchmark gemv_tile!($c1, $A, $b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     22.663 ms (0.00% GC)
  median time:      22.809 ms (0.00% GC)
  mean time:        22.837 ms (0.00% GC)
  maximum time:     24.563 ms (0.00% GC)
  --------------
  samples:          219
  evals/sample:     1

julia> @benchmark gemv_avx!($c2, $A, $b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     75.392 ms (0.00% GC)
  median time:      76.570 ms (0.00% GC)
  mean time:        77.944 ms (0.00% GC)
  maximum time:     95.516 ms (0.00% GC)
  --------------
  samples:          65
  evals/sample:     1

julia> @benchmark mul!($c1, $A, $b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     235.108 ms (0.00% GC)
  median time:      235.994 ms (0.00% GC)
  mean time:        236.042 ms (0.00% GC)
  maximum time:     237.031 ms (0.00% GC)
  --------------
  samples:          22
  evals/sample:     1

More than 3x faster than just the @avx, and over 10x faster than mul!.

Maybe I should have LoopVectorization add extra loops like this automatically.

You can try different block sizes, but I'd keep them a multiple of 64 for now because gesp on stridedpointers of BitArrays currently has to take steps in increments of 64 (64 * 8 = 512). I could decrease that to a minimum of 8.

(using LoopVectorization with accumulator)

In terms of the code that gets generated, it'll use an accumulator for gemv_avx! as well. It'll just load from c and add that to the accumulator before storing.

chriselrod avatar Apr 01 '20 11:04 chriselrod

You are amazing! You have given me enough to work with. I will try to tackle the remainder terms. Thank you for the awesome package and your lighting response times.

biona001 avatar Apr 01 '20 17:04 biona001

The function gemv_tile!(c, A, b) doesn't seem to work with Julia 1.4.1, versions

[bdcacae8] LoopVectorization v0.8.19
[3d5dd08c] VectorizationBase v0.12.25

With the setup above, I get StackOverflowError:

StackOverflowError:

Stacktrace:
 [1] check_args(::VectorizationBase.PackedStridedPointer{Float64,0}) at /home/kose/.julia/packages/LoopVectorization/ppQgc/src/condense_loopset.jl:278 (repeats 79984 times)

kose-y avatar Jul 23 '20 04:07 kose-y

I fixed the error, but there seems to have been a severe performance regression here:

julia> @benchmark gemv_tile!($c1, $A, $b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     42.795 ms (0.00% GC)
  median time:      43.031 ms (0.00% GC)
  mean time:        43.050 ms (0.00% GC)
  maximum time:     43.517 ms (0.00% GC)
  --------------
  samples:          117
  evals/sample:     1

EDIT: Also getting wrong answers.

The wrong answers are because I fixed the gesping for stridedpointers of BitMatrix, we have to use 512 now:

function gemv_tile!(c, A, b)
    M, N = size(A)
    Miter = M >>> 9 # fast div(M, 512)
    Mrem = M & 511 # fast rem(M, 512)
    Niter = N >>> 9
    Nrem = N & 511
    GC.@preserve c A b for n in 0:Niter-1
        for m in 0:Miter-1
            gemv_avx_512by512!(
                gesp(stridedpointer(c), (512m,)),
                gesp(stridedpointer(A), (512m, 512n)),
                gesp(stridedpointer(b), (512n,))
            )
        end
        # TODO: handle mrem
    end
    # TODO: handle nrem
end

With this, I get the right answer. Still slow, though (although still faster than the other two versions).

chriselrod avatar Jul 24 '20 09:07 chriselrod

The easiest way to improve performance may be to define a custom BitMatrix/BitArray type that is padded to have its leading axis contain a multiple of 8 elements, so that each column will be byte-aligned and we don't need all the checks and shifts to get the correct loads into a vector we have now.

Another option is that I did add code for aligning columns to LoopVectorization. It didn't seem to help in the cases I tried, but it could help here if we can rely on it, and then have loads without all the checks. This would preclude unrolling the j loop, and thus lead to much worse performance than the former (custom type) option, but with the benefit of possibly improving multi-dimensional BitArray performance in general.

chriselrod avatar Jul 24 '20 10:07 chriselrod

There already is a padded data structure. How do I get rid of all the checks?

kose-y avatar Jul 28 '20 10:07 kose-y