LoopVectorization.jl
LoopVectorization.jl copied to clipboard
BitArray support?
In some cases, @ turbo is far slower on BitArrays than on anything else, and in other cases check_args fails and the vectorization doesn't happen. I can't replicate the check_args failing, but will return with an MWE when I figure that out.
using LoopVectorization, BenchmarkTools
D = rand(10000, 20000); D_bit = D .> .5; D_bool = Array{Bool}(D_bit)
function test_sum_turbo(data::D) where {T, D <: AbstractArray{T,2}}
contribution = 0
@turbo for n ∈ axes(data, 1), t ∈ axes(data, 2)
contribution += data[n,t]
end
return contribution
end
julia> @btime test_sum_turbo($D)
45.742 ms (0 allocations: 0 bytes)
9.999992255027126e7
julia> @btime test_sum_turbo($D_bit)
69.028 ms (0 allocations: 0 bytes)
99998730
julia> @btime test_sum_turbo($D_bool)
15.711 ms (0 allocations: 0 bytes)
99998730
Hmm, I get
julia> @btime test_sum_turbo($D)
94.021 ms (0 allocations: 0 bytes)
9.999797263890678e7
julia> @btime test_sum_turbo($D_bit)
14.198 ms (0 allocations: 0 bytes)
99995026
julia> @btime test_sum_turbo($D_bool)
15.336 ms (0 allocations: 0 bytes)
99995026
julia> versioninfo()
Julia Version 1.9.0-DEV.79
Commit df81bf9a96 (2022-02-24 07:30 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
What CPU are you using?
For me, the Float64 case is totally dominated by memory bandwidth.
I can't replicate the check_args failing
check_args fails when (ndims(D_bit) > 1) && (size(D_bit,1) % 8 != 0).
Oh nooo is this because I switched CPUs?
edit: Yup, Ryzen 9 5000 series doesn't have AVX512 support. That'll account for the entire discrepancy? Is this going to keep slowing me down by factors of 5-10x? I'm still in the return window...
Edit edit: Ah, actually it looks like our times are comparable except in the single case of D_bit. Is there anywhere else this would pop up as a problem or do I just need to avoid BitArrays with LoopVec?
Julia Version 1.6.5
Commit 9058264a69 (2021-12-19 12:30 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 9 5900HS with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS =
edit: Yup, Ryzen 9 5000 series doesn't have AVX512 support. That'll account for the entire discrepancy? Is this going to keep slowing me down by factors of 5-10x? I'm still in the return window...
Depends on your workload.
I'm surprised by the fact the Ryzen seems to have 2x the memory bandwidth on a single core, given it was twice as fast for Float64.
Odds are the AVX512 CPU is faster when we make the arrays far smaller:
julia> D = rand(1000, 20); D_bit = D .> .5; D_bool = Array{Bool}(D_bit);
julia> @btime test_sum_turbo($D)
974.062 ns (0 allocations: 0 bytes)
10002.523135950083
julia> @btime test_sum_turbo($D_bit)
1.397 μs (0 allocations: 0 bytes)
10039
julia> @btime test_sum_turbo($D_bool)
1.006 μs (0 allocations: 0 bytes)
10039
But of course, you should care most about the size range and specific problems you're working on.
If you don't need BitArrays, seems like it may be best to avoid them.
But I'm a little surprised, too, that I'm getting worse performance for the BitArray than for the Array{Bool} here.
Looking at the assembly...it is awful. Embarrassingly bad.
.LBB0_7: # %L220
# Parent Loop BB0_4 Depth=1
# => This Inner Loop Header: Depth=2
mov rcx, rdi
sar rcx, 3
lea r12, [rax + rcx]
kmovb k2, byte ptr [rax + rcx]
kmovb k3, byte ptr [rdx + r12]
kmovb k4, byte ptr [rsi + r12]
kmovb k5, byte ptr [rbx + r12]
vpsubq zmm9 {k2}, zmm16, zmm0
vpsubq zmm10 {k3}, zmm15, zmm0
vpsubq zmm11 {k4}, zmm14, zmm0
vpsubq zmm12 {k5}, zmm13, zmm0
add rdi, 8
vmovdqa64 zmm13, zmm12
vmovdqa64 zmm14, zmm11
vmovdqa64 zmm15, zmm10
vmovdqa64 zmm16, zmm9
vmovdqa64 zmm3, zmm9
vmovdqa64 zmm2, zmm10
vmovdqa64 zmm4, zmm11
vmovdqa64 zmm1, zmm12
vmovdqa64 zmm8, zmm9
vmovdqa64 zmm7, zmm10
vmovdqa64 zmm6, zmm11
vmovdqa64 zmm5, zmm12
cmp r11, rdi
jge .LBB0_7
Sure, let's do a little bit of arithmetic, and then do 3x the work by copying the memory all over the place for no reason.
Yup Intel has the edge on the smaller size by ~20% with floats, and ~40% with bools (and 700% with bitarrays).
julia> D = rand(1000, 20); D_bit = D .> .5; D_bool = Array{Bool}(D_bit);
julia> @btime test_sum_turbo($D)
1.150 μs (0 allocations: 0 bytes)
10026.229647114582
julia> @btime test_sum_turbo($D_bit)
8.800 μs (0 allocations: 0 bytes)
9985
julia> @btime test_sum_turbo($D_bool)
1.390 μs (0 allocations: 0 bytes)
9985
BitArrays aren't necessary for my computation, they just fell out naturally from A .> .5--- I can convert to Array{Bool} easily. Unfortunately the sizes I work with include both tiny 20x100 and bigger 20x100000000, though more of the smaller.
I'm satisfied enough with the explanation that it's my processor to close this, unless you want to track something with that inefficient BitArray assembly. Also I'd be happy to put a note about BitArrays' size(A,1) % 8 somewhere in the docs, if it belongs anywhere. It's niche enough it might just belong in a searchable issue, i.e. here.
Sure, a PR would be welcome.
The assembly isn't great for the other methods either. You could leave this PR open to track that.
I have an idea of something that might coax LLVM into doing the right thing that I'd be happy to explain if anyone wants to give it a try. Otherwise, I won't address it myself. I'm spending all the free time that I can dedicate to SIMD to rewriting this library, rather than working on the old. I wouldn't expect the rewrite to be ready any time soon.
Ok, I'll do the PR this weekend.
If you want to explain it, I'd be happy to give it a shot. This library has been more than helpful enough that I'd invest some time to figure out how it works. No promises that I'll understand it well enough, though.
If you want to explain it, I'd be happy to give it a shot.
Basically, I think this copying is related to the duplications we get in the phi nodes:
L216: ; preds = %L216, %L170
%29 = phi i64 [ %35, %L216 ], [ %28, %L170 ]
%value_phi122817 = phi i64 [ %res.i645, %L216 ], [ 0, %L170 ]
%value_phi105816 = phi <8 x i64> [ %res.i647, %L216 ], [ %value_phi81852, %L170 ]
%value_phi104815 = phi <8 x i64> [ %res.i649, %L216 ], [ %value_phi80851, %L170 ]
%value_phi103814 = phi <8 x i64> [ %res.i651, %L216 ], [ %value_phi79850, %L170 ]
%value_phi102813 = phi <8 x i64> [ %res.i653, %L216 ], [ %value_phi78849, %L170 ]
%value_phi101812 = phi <8 x i64> [ %res.i647, %L216 ], [ %value_phi77848, %L170 ]
%value_phi100811 = phi <8 x i64> [ %res.i649, %L216 ], [ %value_phi76847, %L170 ]
%value_phi99810 = phi <8 x i64> [ %res.i651, %L216 ], [ %value_phi75846, %L170 ]
%value_phi98809 = phi <8 x i64> [ %res.i653, %L216 ], [ %value_phi74845, %L170 ]
%shiftedind.i678 = ashr i64 %29, 3
%ptr.1.i679707 = getelementptr inbounds i8, i8* %18, i64 %shiftedind.i678
%30 = bitcast i8* %ptr.1.i679707 to <8 x i1>*
%res1.i676708 = load <8 x i1>, <8 x i1>* %30, align 1
%ptr.1.i670709 = getelementptr inbounds i8, i8* %ptr.1.i679707, i64 %24
%31 = bitcast i8* %ptr.1.i670709 to <8 x i1>*
%res1.i671710 = load <8 x i1>, <8 x i1>* %31, align 1
%ptr.1.i665711 = getelementptr inbounds i8, i8* %ptr.1.i679707, i64 %25
%32 = bitcast i8* %ptr.1.i665711 to <8 x i1>*
%res1.i666712 = load <8 x i1>, <8 x i1>* %32, align 1
%ptr.1.i660713 = getelementptr inbounds i8, i8* %ptr.1.i679707, i64 %26
%33 = bitcast i8* %ptr.1.i660713 to <8 x i1>*
%res1.i661714 = load <8 x i1>, <8 x i1>* %33, align 1
%res.i657 = add <8 x i64> %value_phi98809, <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>
%res.i656 = add <8 x i64> %value_phi99810, <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>
%res.i655 = add <8 x i64> %value_phi100811, <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>
%res.i654 = add <8 x i64> %value_phi101812, <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>
%res.i653 = select <8 x i1> %res1.i676708, <8 x i64> %res.i657, <8 x i64> %value_phi102813
%res.i651 = select <8 x i1> %res1.i671710, <8 x i64> %res.i656, <8 x i64> %value_phi103814
%res.i649 = select <8 x i1> %res1.i666712, <8 x i64> %res.i655, <8 x i64> %value_phi104815
%res.i647 = select <8 x i1> %res1.i661714, <8 x i64> %res.i654, <8 x i64> %value_phi105816
%res.i645 = add nsw i64 %value_phi122817, -8
%34 = add i64 %27, %res.i645
%35 = sub i64 0, %34
%.not706 = icmp slt i64 %.neg, %35
br i1 %.not706, label %L247, label %L216
If you look at the @code_typed, you'll see messes of φ nodes already in the Julia level.
So, the basic idea is to avoid this via not reassigning, and instead updating memory.
Then, at the LLVM level, we should be able to trust mem2reg to convert these stack memory operations back into phi nodes, but (hopefully) much better than Julia is doing.
In other words, what we want to do is instead of generating code like
accumulator = 0
for i in ...
accumulator = accumator + ...
end
return accumulator
we want to generate code like
accumulator = Ref(0)
for i in ...
accumulator[] = accumulator[] + ...
end
return accumulator[]
So, instead of reassignments (requiring phi nodes) we just modify memory.
Ref should be stack allocated here (it doesn't escape), and then LLVM can convert the latter back into the former, but hopefully do a better job than Julia.
So, we need to change initialization, using them in computations, assigning to them, and finally returning them.
Here is the function that initializes outer reductions: https://github.com/JuliaSIMD/LoopVectorization.jl/blob/4fecc5748494ed438f243e16849589c8312e60dc/src/codegen/lowering.jl#L516-L542
For both using them in computations and for assigning to them, you'd need to edit lower_compute.
This would probably be the most annoying one, but you can already see a check for whether a variable is an outer reduction here:
https://github.com/JuliaSIMD/LoopVectorization.jl/blob/4fecc5748494ed438f243e16849589c8312e60dc/src/codegen/lower_compute.jl#L508
If so, you can then do the appropriate [] on reference and assignment.
Finally, you'd need to handle the returns, which hopefully doesn't require much more than these two locations: https://github.com/JuliaSIMD/LoopVectorization.jl/blob/4fecc5748494ed438f243e16849589c8312e60dc/src/reconstruct_loopset.jl#L588 This handles the values just before returning. You could define a function like
@inline unwrapref(x::Base.RefValue) = x[]
@inline unwrapref(x) = x
and then call that function like
extractt = Expr(:call, GlobalRef(Core,:getfield), Expr(:call, unwrapref, Symbol("#vargs#")), (offset+=1), false)
This will handle the case of if it isn't a ref, which could be because....
https://github.com/JuliaSIMD/LoopVectorization.jl/blob/4fecc5748494ed438f243e16849589c8312e60dc/src/codegen/lowering.jl#L614
This code reduces multiple vecs (when unrolled) to a single one. You don't have to Ref-wrap what gets returned here, because it should be handled by unwrapref above.
You'll also be able to delete this code, as it was another attempt at working around it: https://github.com/JuliaSIMD/LoopVectorization.jl/blob/4fecc5748494ed438f243e16849589c8312e60dc/src/codegen/lowering.jl#L405-L408
Something that may be annoying is that sometimes outer reduction variables can change their size/name.
E.g., it'll 4x unroll for part of it, naming the variables something like gensymedname_4, and then 2x unroll, renaming them to gensymedname_2.
That'd be partly covered by the init expr, but also need
https://github.com/JuliaSIMD/LoopVectorization.jl/blob/4fecc5748494ed438f243e16849589c8312e60dc/src/codegen/lowering.jl#L567