LoopVectorization.jl BitArray support?

trafficstars

In some cases, @ turbo is far slower on BitArrays than on anything else, and in other cases check_args fails and the vectorization doesn't happen. I can't replicate the check_args failing, but will return with an MWE when I figure that out.

using LoopVectorization, BenchmarkTools

D = rand(10000, 20000); D_bit = D .> .5; D_bool = Array{Bool}(D_bit) 

function test_sum_turbo(data::D) where {T, D <: AbstractArray{T,2}}
    contribution = 0
    @turbo for n ∈ axes(data, 1), t ∈ axes(data, 2)
        contribution += data[n,t]
    end
    return contribution
end

julia> @btime test_sum_turbo($D)
  45.742 ms (0 allocations: 0 bytes)
9.999992255027126e7

julia> @btime test_sum_turbo($D_bit)
  69.028 ms (0 allocations: 0 bytes)
99998730

julia> @btime test_sum_turbo($D_bool)
  15.711 ms (0 allocations: 0 bytes)
99998730

Feb 25 '22 05:02 grahamas

Hmm, I get

julia> @btime test_sum_turbo($D)
  94.021 ms (0 allocations: 0 bytes)
9.999797263890678e7

julia> @btime test_sum_turbo($D_bit)
  14.198 ms (0 allocations: 0 bytes)
99995026

julia> @btime test_sum_turbo($D_bool)
  15.336 ms (0 allocations: 0 bytes)
99995026

julia> versioninfo()
Julia Version 1.9.0-DEV.79
Commit df81bf9a96 (2022-02-24 07:30 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz

What CPU are you using? For me, the Float64 case is totally dominated by memory bandwidth.

I can't replicate the check_args failing

check_args fails when (ndims(D_bit) > 1) && (size(D_bit,1) % 8 != 0).

Feb 25 '22 13:02 chriselrod

Oh nooo is this because I switched CPUs?

edit: Yup, Ryzen 9 5000 series doesn't have AVX512 support. That'll account for the entire discrepancy? Is this going to keep slowing me down by factors of 5-10x? I'm still in the return window...

Edit edit: Ah, actually it looks like our times are comparable except in the single case of D_bit. Is there anywhere else this would pop up as a problem or do I just need to avoid BitArrays with LoopVec?

Julia Version 1.6.5
Commit 9058264a69 (2021-12-19 12:30 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 5900HS with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

Feb 25 '22 17:02 grahamas

edit: Yup, Ryzen 9 5000 series doesn't have AVX512 support. That'll account for the entire discrepancy? Is this going to keep slowing me down by factors of 5-10x? I'm still in the return window...

Depends on your workload. I'm surprised by the fact the Ryzen seems to have 2x the memory bandwidth on a single core, given it was twice as fast for Float64. Odds are the AVX512 CPU is faster when we make the arrays far smaller:

julia> D = rand(1000, 20); D_bit = D .> .5; D_bool = Array{Bool}(D_bit);

julia> @btime test_sum_turbo($D)
  974.062 ns (0 allocations: 0 bytes)
10002.523135950083

julia> @btime test_sum_turbo($D_bit)
  1.397 μs (0 allocations: 0 bytes)
10039

julia> @btime test_sum_turbo($D_bool)
  1.006 μs (0 allocations: 0 bytes)
10039

But of course, you should care most about the size range and specific problems you're working on.

If you don't need BitArrays, seems like it may be best to avoid them. But I'm a little surprised, too, that I'm getting worse performance for the BitArray than for the Array{Bool} here.

Looking at the assembly...it is awful. Embarrassingly bad.

.LBB0_7:                                # %L220
                                        #   Parent Loop BB0_4 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        mov     rcx, rdi
        sar     rcx, 3
        lea     r12, [rax + rcx]
        kmovb   k2, byte ptr [rax + rcx]
        kmovb   k3, byte ptr [rdx + r12]
        kmovb   k4, byte ptr [rsi + r12]
        kmovb   k5, byte ptr [rbx + r12]
        vpsubq  zmm9 {k2}, zmm16, zmm0
        vpsubq  zmm10 {k3}, zmm15, zmm0
        vpsubq  zmm11 {k4}, zmm14, zmm0
        vpsubq  zmm12 {k5}, zmm13, zmm0
        add     rdi, 8
        vmovdqa64       zmm13, zmm12
        vmovdqa64       zmm14, zmm11
        vmovdqa64       zmm15, zmm10
        vmovdqa64       zmm16, zmm9
        vmovdqa64       zmm3, zmm9
        vmovdqa64       zmm2, zmm10
        vmovdqa64       zmm4, zmm11
        vmovdqa64       zmm1, zmm12
        vmovdqa64       zmm8, zmm9
        vmovdqa64       zmm7, zmm10
        vmovdqa64       zmm6, zmm11
        vmovdqa64       zmm5, zmm12
        cmp     r11, rdi
        jge     .LBB0_7

Sure, let's do a little bit of arithmetic, and then do 3x the work by copying the memory all over the place for no reason.

Feb 25 '22 18:02 chriselrod

Yup Intel has the edge on the smaller size by ~20% with floats, and ~40% with bools (and 700% with bitarrays).

julia>  D = rand(1000, 20); D_bit = D .> .5; D_bool = Array{Bool}(D_bit);

julia> @btime test_sum_turbo($D)
  1.150 μs (0 allocations: 0 bytes)
10026.229647114582

julia> @btime test_sum_turbo($D_bit)
  8.800 μs (0 allocations: 0 bytes)
9985

julia> @btime test_sum_turbo($D_bool)
  1.390 μs (0 allocations: 0 bytes)
9985

BitArrays aren't necessary for my computation, they just fell out naturally from A .> .5--- I can convert to Array{Bool} easily. Unfortunately the sizes I work with include both tiny 20x100 and bigger 20x100000000, though more of the smaller.

I'm satisfied enough with the explanation that it's my processor to close this, unless you want to track something with that inefficient BitArray assembly. Also I'd be happy to put a note about BitArrays' size(A,1) % 8 somewhere in the docs, if it belongs anywhere. It's niche enough it might just belong in a searchable issue, i.e. here.

Feb 25 '22 19:02 grahamas

Sure, a PR would be welcome.

The assembly isn't great for the other methods either. You could leave this PR open to track that.

I have an idea of something that might coax LLVM into doing the right thing that I'd be happy to explain if anyone wants to give it a try. Otherwise, I won't address it myself. I'm spending all the free time that I can dedicate to SIMD to rewriting this library, rather than working on the old. I wouldn't expect the rewrite to be ready any time soon.

Feb 25 '22 19:02 chriselrod

Ok, I'll do the PR this weekend.

If you want to explain it, I'd be happy to give it a shot. This library has been more than helpful enough that I'd invest some time to figure out how it works. No promises that I'll understand it well enough, though.

Feb 25 '22 19:02 grahamas

If you want to explain it, I'd be happy to give it a shot.

Basically, I think this copying is related to the duplications we get in the phi nodes:

L216:                                             ; preds = %L216, %L170
  %29 = phi i64 [ %35, %L216 ], [ %28, %L170 ]
  %value_phi122817 = phi i64 [ %res.i645, %L216 ], [ 0, %L170 ]
  %value_phi105816 = phi <8 x i64> [ %res.i647, %L216 ], [ %value_phi81852, %L170 ]
  %value_phi104815 = phi <8 x i64> [ %res.i649, %L216 ], [ %value_phi80851, %L170 ]
  %value_phi103814 = phi <8 x i64> [ %res.i651, %L216 ], [ %value_phi79850, %L170 ]
  %value_phi102813 = phi <8 x i64> [ %res.i653, %L216 ], [ %value_phi78849, %L170 ]
  %value_phi101812 = phi <8 x i64> [ %res.i647, %L216 ], [ %value_phi77848, %L170 ]
  %value_phi100811 = phi <8 x i64> [ %res.i649, %L216 ], [ %value_phi76847, %L170 ]
  %value_phi99810 = phi <8 x i64> [ %res.i651, %L216 ], [ %value_phi75846, %L170 ]
  %value_phi98809 = phi <8 x i64> [ %res.i653, %L216 ], [ %value_phi74845, %L170 ]
  %shiftedind.i678 = ashr i64 %29, 3
  %ptr.1.i679707 = getelementptr inbounds i8, i8* %18, i64 %shiftedind.i678
  %30 = bitcast i8* %ptr.1.i679707 to <8 x i1>*
  %res1.i676708 = load <8 x i1>, <8 x i1>* %30, align 1
  %ptr.1.i670709 = getelementptr inbounds i8, i8* %ptr.1.i679707, i64 %24
  %31 = bitcast i8* %ptr.1.i670709 to <8 x i1>*
  %res1.i671710 = load <8 x i1>, <8 x i1>* %31, align 1
  %ptr.1.i665711 = getelementptr inbounds i8, i8* %ptr.1.i679707, i64 %25
  %32 = bitcast i8* %ptr.1.i665711 to <8 x i1>*
  %res1.i666712 = load <8 x i1>, <8 x i1>* %32, align 1
  %ptr.1.i660713 = getelementptr inbounds i8, i8* %ptr.1.i679707, i64 %26
  %33 = bitcast i8* %ptr.1.i660713 to <8 x i1>*
  %res1.i661714 = load <8 x i1>, <8 x i1>* %33, align 1
  %res.i657 = add <8 x i64> %value_phi98809, <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>
  %res.i656 = add <8 x i64> %value_phi99810, <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>
  %res.i655 = add <8 x i64> %value_phi100811, <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>
  %res.i654 = add <8 x i64> %value_phi101812, <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>
  %res.i653 = select <8 x i1> %res1.i676708, <8 x i64> %res.i657, <8 x i64> %value_phi102813
  %res.i651 = select <8 x i1> %res1.i671710, <8 x i64> %res.i656, <8 x i64> %value_phi103814
  %res.i649 = select <8 x i1> %res1.i666712, <8 x i64> %res.i655, <8 x i64> %value_phi104815
  %res.i647 = select <8 x i1> %res1.i661714, <8 x i64> %res.i654, <8 x i64> %value_phi105816
  %res.i645 = add nsw i64 %value_phi122817, -8
  %34 = add i64 %27, %res.i645
  %35 = sub i64 0, %34
  %.not706 = icmp slt i64 %.neg, %35
  br i1 %.not706, label %L247, label %L216

If you look at the @code_typed, you'll see messes of φ nodes already in the Julia level.

So, the basic idea is to avoid this via not reassigning, and instead updating memory. Then, at the LLVM level, we should be able to trust mem2reg to convert these stack memory operations back into phi nodes, but (hopefully) much better than Julia is doing.

In other words, what we want to do is instead of generating code like

accumulator = 0
for i in ...
    accumulator = accumator + ...
end
return accumulator

we want to generate code like

accumulator = Ref(0)
for i in ...
    accumulator[] = accumulator[] + ...
end
return accumulator[]

So, instead of reassignments (requiring phi nodes) we just modify memory. Ref should be stack allocated here (it doesn't escape), and then LLVM can convert the latter back into the former, but hopefully do a better job than Julia.

So, we need to change initialization, using them in computations, assigning to them, and finally returning them.

Here is the function that initializes outer reductions: https://github.com/JuliaSIMD/LoopVectorization.jl/blob/4fecc5748494ed438f243e16849589c8312e60dc/src/codegen/lowering.jl#L516-L542

For both using them in computations and for assigning to them, you'd need to edit lower_compute. This would probably be the most annoying one, but you can already see a check for whether a variable is an outer reduction here: https://github.com/JuliaSIMD/LoopVectorization.jl/blob/4fecc5748494ed438f243e16849589c8312e60dc/src/codegen/lower_compute.jl#L508 If so, you can then do the appropriate [] on reference and assignment.

Finally, you'd need to handle the returns, which hopefully doesn't require much more than these two locations: https://github.com/JuliaSIMD/LoopVectorization.jl/blob/4fecc5748494ed438f243e16849589c8312e60dc/src/reconstruct_loopset.jl#L588 This handles the values just before returning. You could define a function like

@inline unwrapref(x::Base.RefValue) = x[]
@inline unwrapref(x) = x

and then call that function like

    extractt = Expr(:call, GlobalRef(Core,:getfield), Expr(:call, unwrapref, Symbol("#vargs#")), (offset+=1), false)

This will handle the case of if it isn't a ref, which could be because.... https://github.com/JuliaSIMD/LoopVectorization.jl/blob/4fecc5748494ed438f243e16849589c8312e60dc/src/codegen/lowering.jl#L614 This code reduces multiple vecs (when unrolled) to a single one. You don't have to Ref-wrap what gets returned here, because it should be handled by unwrapref above.

You'll also be able to delete this code, as it was another attempt at working around it: https://github.com/JuliaSIMD/LoopVectorization.jl/blob/4fecc5748494ed438f243e16849589c8312e60dc/src/codegen/lowering.jl#L405-L408

Something that may be annoying is that sometimes outer reduction variables can change their size/name. E.g., it'll 4x unroll for part of it, naming the variables something like gensymedname_4, and then 2x unroll, renaming them to gensymedname_2. That'd be partly covered by the init expr, but also need https://github.com/JuliaSIMD/LoopVectorization.jl/blob/4fecc5748494ed438f243e16849589c8312e60dc/src/codegen/lowering.jl#L567

Feb 26 '22 06:02 chriselrod

LoopVectorization.jl LoopVectorization.jl copied to clipboard

BitArray support?

LoopVectorization.jl
LoopVectorization.jl copied to clipboard