MultiFloats.jl
MultiFloats.jl copied to clipboard
Vectorization problems (and how to fix them)
Current codegen is unfortunately not great, for instance a trivial loop like this:
julia> function trivial_sum(xs::Vector{Float64x4})
t = zero(Float64x4)
@simd for i in 1:length(xs)
t += @inbounds xs[i]
end
t
end
generates the following inner loop:
L48:
vmovsd -24(%rdx), %xmm5 # xmm5 = mem[0],zero
vmovsd -16(%rdx), %xmm6 # xmm6 = mem[0],zero
vmovsd -8(%rdx), %xmm7 # xmm7 = mem[0],zero
vaddsd %xmm5, %xmm4, %xmm9
vsubsd %xmm4, %xmm9, %xmm0
vsubsd %xmm0, %xmm9, %xmm3
vsubsd %xmm3, %xmm4, %xmm3
vsubsd %xmm0, %xmm5, %xmm0
vaddsd %xmm3, %xmm0, %xmm0
vaddsd %xmm6, %xmm2, %xmm3
vsubsd %xmm2, %xmm3, %xmm4
vsubsd %xmm4, %xmm3, %xmm5
vsubsd %xmm5, %xmm2, %xmm2
vsubsd %xmm4, %xmm6, %xmm4
vaddsd %xmm2, %xmm4, %xmm2
vaddsd %xmm7, %xmm1, %xmm4
vsubsd %xmm1, %xmm4, %xmm5
vsubsd %xmm5, %xmm4, %xmm6
vsubsd %xmm6, %xmm1, %xmm1
vsubsd %xmm5, %xmm7, %xmm5
vaddsd %xmm1, %xmm5, %xmm1
vaddsd (%rdx), %xmm8, %xmm5
vaddsd %xmm1, %xmm5, %xmm1
vaddsd %xmm0, %xmm3, %xmm5
vsubsd %xmm3, %xmm5, %xmm6
vsubsd %xmm6, %xmm5, %xmm7
vsubsd %xmm7, %xmm3, %xmm3
vsubsd %xmm6, %xmm0, %xmm0
vaddsd %xmm3, %xmm0, %xmm0
vaddsd %xmm2, %xmm4, %xmm3
vsubsd %xmm4, %xmm3, %xmm6
vsubsd %xmm6, %xmm3, %xmm7
vsubsd %xmm7, %xmm4, %xmm4
vsubsd %xmm6, %xmm2, %xmm2
vaddsd %xmm4, %xmm2, %xmm2
vaddsd %xmm0, %xmm3, %xmm4
vsubsd %xmm0, %xmm4, %xmm6
vsubsd %xmm6, %xmm4, %xmm7
vsubsd %xmm7, %xmm0, %xmm0
vsubsd %xmm6, %xmm3, %xmm3
vaddsd %xmm0, %xmm3, %xmm0
vaddsd %xmm0, %xmm2, %xmm0
vaddsd %xmm0, %xmm1, %xmm0
vaddsd %xmm0, %xmm4, %xmm1
vsubsd %xmm4, %xmm1, %xmm2
vsubsd %xmm2, %xmm0, %xmm0
vaddsd %xmm1, %xmm5, %xmm2
vsubsd %xmm5, %xmm2, %xmm3
vsubsd %xmm3, %xmm1, %xmm1
vaddsd %xmm2, %xmm9, %xmm4
vsubsd %xmm9, %xmm4, %xmm3
vsubsd %xmm3, %xmm2, %xmm3
vaddsd %xmm3, %xmm1, %xmm2
vsubsd %xmm3, %xmm2, %xmm3
vsubsd %xmm3, %xmm1, %xmm3
vaddsd %xmm3, %xmm0, %xmm1
vsubsd %xmm3, %xmm1, %xmm3
vsubsd %xmm3, %xmm0, %xmm8
addq $32, %rdx
decq %rcx
jne L48
By relaxing the type restrictions in MultiFloats.jl a bit, and patching VectorizationBase.jl to add another Vec type that doesn't allow for certain fast-math ops, it should be possible to get a 4x speedup on AVX2 and potentially 8x on AVX512 (well, that's the upper bound :p).
A minimal set of changes to allow to use VectorizationBase is here: https://github.com/haampie/MultiFloats.jl/commit/0e83c05213e4fe3647dc405eaf62763ca46d4989
The idea is to work with a type MultiFloat{Vec{M,T},N} and run the basic *, /, +, and - operations on 2, 4, or 8 numbers simultaneously.
Hey @haampie , I've unfortunately noticed the same issue. I've been able to get very simple loops like @inbounds c[i] = a[i] + b[i]
to vectorize for Float64x[2, 3, 4]
on -O3
, but it breaks down for larger MultiFloat
types, and reduction loops have never worked. The failure of reductions to vectorize is totally expected, since Julia doesn't know that MultiFloat
addition can be reordered -- only that the scalar additions within the MultiFloat
addition function can be.
There are two things I would like to fix here. First, there are loops like @inbounds c[i] = a[i] + b[i] * c[i]
for Float64x8
which Julia should in principle be able to vectorize, but doesn't -- I assume there must be some threshold inside Julia or LLVM that says "don't bother trying to vectorize loops past a certain size." Second, in order to enable vectorization of reduction loops, I would like to be able to inform Julia that inside a @simd
block, MultiFloat
arithmetic operations can be freely re-associated.
It looks like the intent of the changes you're proposing here to allow programmers to work around these issues by writing explicitly vectorized code using MultiFloat{Vec{M,T}, N}
. If that's the case, then I would like to avoid generalizing the type restrictions all the way down to Real
-- I don't want people thinking they can get a 256-bit integer by writing MultiFloat{Int256, 2}
. Instead, I think we should proceed by changing MultiFloats.AF
to a Union
of Float16
, Float32
, Float64
, BigFloat
, and Vec
types.
(Indeed, AbstractFloat
was already too permissive to begin with -- MultiFloat{Float64x2, 2}
should not be allowed either, since the MultiFloat
arithmetic algorithms depend crucially on IEEE round-to-nearest semantics for error propagation. The only reason I used AbstractFloat
instead of IEEEFloat
in the first place was that I wanted to do some testing with MultiFloat{BigFloat, N}
.)
Just as an example:
using MultiFloats, VectorizationBase
using VectorizationBase: extractelement, pick_vector_width
using MultiFloats: renormalize
@generated function MultiFloatOfVec(fs::NTuple{M,MultiFloat{T,N}}) where {T,M,N}
exprs = [:(Vec($([:(fs[$j]._limbs[$i]) for j=1:M]...))) for i=1:N]
return quote
MultiFloat(tuple($(exprs...)))
end
end
@generated function TupleOfMultiFloat(fs::MultiFloat{Vec{M,T},N}) where {T,M,N}
exprs = [:(MultiFloat(tuple($([:(extractelement(fs._limbs[$j], $i)) for j=1:N]...)))) for i=0:M-1]
return quote
tuple($(exprs...))
end
end
function trivial_sum(xs)
t = zero(eltype(xs))
@inbounds for x in xs
t += x
end
return t
end
function vectorized_sum(xs::Vector{MultiFloat{T,N}}) where {T,N}
M = pick_vector_width(T)
t = zero(MultiFloat{Vec{M,T},N})
@inbounds for i = 1:M:length(xs)-M+1
t += MultiFloatOfVec(ntuple(k -> xs[i + k - 1], M))
end
return +(TupleOfMultiFloat(t)...)
end
function trivial_dot(xs, ys)
t = zero(eltype(xs))
@inbounds for i = 1:length(xs)
t += xs[i] * ys[i]
end
return t
end
function vectorized_dot(xs::Vector{MultiFloat{T,N}}, ys::Vector{MultiFloat{T,N}}) where {T,N}
M = pick_vector_width(T)
t = zero(MultiFloat{Vec{M,T},N})
@inbounds for i = 1:M:length(xs)-M+1
x = MultiFloatOfVec(ntuple(k -> xs[i + k - 1], M))
y = MultiFloatOfVec(ntuple(k -> ys[i + k - 1], M))
t += x * y
end
return +(TupleOfMultiFloat(t)...)
end
using BenchmarkTools
random_vec(::Type{MultiFloat{T,N}}, k) where {T,N} =
[renormalize(MultiFloat(ntuple(i -> rand(T) * eps(T)^(i-1), N))) for _ = 1:k]
function benchmark_dot(::Type{T}) where {T<:MultiFloat}
xs = random_vec(T, 4096)
ys = random_vec(T, 4096)
@show vectorized_dot(xs, ys) - trivial_dot(xs, ys)
vectorized = @benchmark vectorized_dot($xs, $ys)
trivial = @benchmark trivial_dot($xs, $ys)
return vectorized, trivial
end
function benchmark_sum(::Type{T}) where {T<:MultiFloat}
xs = random_vec(T, 4096)
@show vectorized_sum(xs) - trivial_sum(xs)
vectorized = @benchmark vectorized_sum($xs)
trivial = @benchmark trivial_sum($xs)
return vectorized, trivial
end
benchmarks timings:
julia> benchmark_sum(Float64x8)
vectorized_sum(xs) - trivial_sum(xs) = 0.0
(Trial(59.062 μs), Trial(240.739 μs))
julia> benchmark_sum(Float64x4)
vectorized_sum(xs) - trivial_sum(xs) = 3.1115076389305709e-61
(Trial(21.876 μs), Trial(88.099 μs))
julia> benchmark_dot(Float64x8)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -1.3849467926678604e-127
(Trial(328.814 μs), Trial(854.370 μs))
julia> benchmark_dot(Float64x4)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -7.7787690973264271e-62
(Trial(37.992 μs), Trial(128.553 μs))
To replicate, install a patched VectorizationBase:
] add https://github.com/haampie/VectorizationBase.jl#without-fastmath-everywhere
I think this is the best way to do it in Julia
Seems like dot
doesn't entirely inline :(.
Some more data points for AVX-512 on an Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz with vectors of size 2^13:
julia> benchmark_sum(Float64x8)
vectorized_sum(xs) - trivial_sum(xs) = -5.9091063153828709e-126
(Trial(124.202 μs), Trial(499.626 μs))
julia> benchmark_sum(Float64x7)
vectorized_sum(xs) - trivial_sum(xs) = -4.5240823300086601e-109
(Trial(100.521 μs), Trial(422.145 μs))
julia> benchmark_sum(Float64x6)
vectorized_sum(xs) - trivial_sum(xs) = 6.7116512220867355e-93
(Trial(78.467 μs), Trial(339.790 μs))
julia> benchmark_sum(Float64x5)
vectorized_sum(xs) - trivial_sum(xs) = -4.7498927053019445e-77
(Trial(47.538 μs), Trial(257.020 μs))
julia> benchmark_sum(Float64x4)
vectorized_sum(xs) - trivial_sum(xs) = -2.6058876476043531e-60
(Trial(33.474 μs), Trial(187.179 μs))
julia> benchmark_sum(Float64x3)
vectorized_sum(xs) - trivial_sum(xs) = -3.7835058536770061e-44
(Trial(17.097 μs), Trial(121.135 μs))
julia> benchmark_sum(Float64x2)
vectorized_sum(xs) - trivial_sum(xs) = -4.543838814073028e-28
(Trial(9.002 μs), Trial(63.851 μs))
and
julia> benchmark_dot(Float64x8)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = 3.7855212332921517e-126
(Trial(555.383 μs), Trial(1.735 ms))
julia> benchmark_dot(Float64x7)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -7.4846950312643274e-110
(Trial(275.906 μs), Trial(1.280 ms))
julia> benchmark_dot(Float64x6)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = 2.7865337663127964e-93
(Trial(200.465 μs), Trial(855.838 μs))
julia> benchmark_dot(Float64x5)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -6.0453179885661112e-77
(Trial(131.219 μs), Trial(558.778 μs))
julia> benchmark_dot(Float64x4)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = 1.2834969010588605e-60
(Trial(89.520 μs), Trial(300.694 μs))
julia> benchmark_dot(Float64x3)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = -3.4331812375958018e-44
(Trial(36.687 μs), Trial(122.643 μs))
julia> benchmark_dot(Float64x2)
vectorized_dot(xs, ys) - trivial_dot(xs, ys) = 3.0292258760486853e-28
(Trial(15.735 μs), Trial(63.286 μs))
Summary:
Type | speedup sum | speedup dot |
---|---|---|
Float64x8 |
4.02 |
3.12 |
Float64x7 |
4.20 |
4.64 |
Float64x6 |
4.33 |
4.27 |
Float64x5 |
5.41 |
4.26 |
Float64x4 |
5.59 |
3.36 |
Float64x3 |
7.09 |
3.34 |
Float64x2 |
7.09 |
4.02 |
So, not enirely the 8x speedup I hoped for, I think the issue here is inlining doesn't work properly.
Maybe we can inline code by hand?
there was still some issue with bounds checks. after fixing it I'm getting:
Type | speedup sum | speedup dot |
---|---|---|
Float64x8 |
4.33 |
3.19 |
Float64x7 |
4.46 |
4.76 |
Float64x6 |
5.12 |
4.39 |
Float64x5 |
5.83 |
4.48 |
Float64x4 |
5.71 |
3.50 |
Float64x3 |
7.11 |
3.58 |
Float64x2 |
7.00 |
4.32 |
Ok, seems like there's no real inlining problems after all. The issue is the code gen for the transform from
NTuple{N,MultiFloat{Float64,M}} -> MultiFloat{Vec{N,Float64},M}
It's pretty much a matrix-transpose in the registers, I've written that kernel before: https://github.com/haampie/FastTranspose.jl/blob/master/src/FastTranspose.jl#L34-L41
I've figured out the problem with dot
, it's the julia codegen that creates an +(a, ...)
expression with too many args.
With an improved transpose kernel I'm getting a 5.40x speedup for dot and a 5.66x speedup for sum for Float64x8 on AVX512 :)
The resulting assembly is quite pretty for dot: https://gist.github.com/haampie/e2e52718208ee7997fc4a34a4649af97
I've improved the register-transpose kernel for general input sizes (for instance Float64x7 requires a 7x8 transpose on AVX-512), and I'm getting the following results:
Type | speedup sum | speedup dot |
---|---|---|
Float64x8 |
5.65 |
5.38 |
Float64x7 |
5.81 |
5.32 |
Float64x6 |
5.79 |
5.18 |
Float64x5 |
5.79 |
5.13 |
Float64x4 |
5.77 |
5.45 |
Float64x3 |
7.17 |
3.58 |
Float64x2 |
7.04 |
5.07 |
Not sure this can be improved really.
For reference the struct-of-arrays version which doesn't require shuffling has the following speedups vs scalar:
Type | speedup sum | speedup dot |
---|---|---|
Float64x8 |
5.77 |
5.45 |
Float64x7 |
5.80 |
5.50 |
Float64x6 |
5.75 |
5.78 |
Float64x5 |
5.77 |
5.93 |
Float64x4 |
5.78 |
5.81 |
Float64x3 |
7.04 |
5.61 |
Float64x2 |
7.12 |
6.76 |
Another result on AVX-512:
julia> @benchmark LinearAlgebra.qrfactUnblocked!(mat) setup=(mat=StructArray(rand(Float64x2, 400, 400)))
BenchmarkTools.Trial:
memory estimate: 168.47 KiB
allocs estimate: 2395
--------------
minimum time: 81.934 ms (0.00% GC)
median time: 84.113 ms (0.00% GC)
mean time: 83.887 ms (0.00% GC)
maximum time: 86.996 ms (0.00% GC)
--------------
samples: 54
evals/sample: 1
julia> @benchmark LinearAlgebra.qrfactUnblocked!(mat) setup=(mat=rand(Double64, 400, 400))
BenchmarkTools.Trial:
memory estimate: 6.38 KiB
allocs estimate: 1
--------------
minimum time: 627.563 ms (0.00% GC)
median time: 628.127 ms (0.00% GC)
mean time: 628.101 ms (0.00% GC)
maximum time: 628.408 ms (0.00% GC)
--------------
samples: 8
evals/sample: 1
this requires still some hand-vectorization of axpy!
and dot/norm
, but basically i think the best strategy is to make people use StructArray of arrays of MultiFloats.
Hey @haampie , this is fantastic work! I am especially impressed with your fast zmm
transpose kernel; I've been trying to think of a good way to do the MultiFloat{Vec{K,Float64},N} <-> NTuple{K,MultiFloat{Float64,N}}
transformation for some time. It's amazing that the AoS versions of sum
and dot
you put together are nearly as fast as (and in some cases, even faster than) the SoA versions. Your performance insights are also much appreciated, as I do not have access to an AVX-512 capable CPU.
Here is what I am going to do. Starting tonight, with the imminent release of MultiFloats.jl v1.0, I am simply going to drop the T <: AbstractFloat
type constraint everywhere it occurs. Even though this might wrongly suggest the possibility of, say, MultiFloat{Int, N}
or MultiFloat{Complex, N}
to new users of the library, you have put forward an overwhelming case that MultiFloat{Vec{K, T}, N}
provides serious benefits, even for non-Vec
-aware code. Considering that there may be other such types (I believe SIMD.jl has its own Vec
type which is not even a subclass of Real
), it doesn't make sense for me to insist on an impractical type hierarchy.
Moving forward, I would also like MultiFloats.jl to provide explicitly vectorized implementations for sum(::Vector{MultiFloat})
and dot(::Vector{MultiFloat})
. I hope you will not mind if I follow your example for these and other simple kernels.
By the way, I had a look at your patch to VectorizationBase, and I believe it should be safe to use MultiFloats.jl with afn arcp nsz
; it's only reassoc
that I depend on in a crucial way. It's a shame that VectorizationBase doesn't provide a mechanism for selectively disabling reassoc
only.
Just a few comments: it would be interesting to look into what's necessary to make LLVM autovectorize reductions, because that would be much more generic than implementing a few vectorized routines. This is a good starting point: https://docs.julialang.org/en/v1/devdocs/llvm/#Passing-options-to-LLVM
Another idea I had is to add a simple "struct of arrays" array type to this package. Basically a permuted array where the # multifloats
is in the last dimension. So for instance a matrix of size m x n
of Float64xN
is represented as a 3 dimensional array of dimensions (m, n, N).
A minimal working example:
struct MFArray{T,M,N,TA} <: AbstractArray{MultiFloat{T,M},N}
A::TA
end
import Base: size, getindex, setindex, view, IndexStyle
using Base: OneTo
using Base.Cartesian: @ntuple, @nexprs
export MFArray
Base.size(A::MFArray) = reverse(Base.tail(reverse(size(A.A))))
Base.IndexStyle(x::MFArray) = IndexStyle(x.A)
@generated function Base.getindex(A::MFArray{T,M,N}, i::Int) where {T,M,N}
quote
$(Expr(:meta, :inline, :propagate_inbounds))
dist = length(A)
MultiFloat{T,M}(@ntuple $M j -> A.A[i + (j - 1) * dist])
end
end
@generated function Base.setindex!(A::MFArray{T,M,N}, v::MultiFloat{T,M}, i::Int) where {T,M,N}
quote
$(Expr(:meta, :inline, :propagate_inbounds))
dist = length(A)
@nexprs $M j -> A.A[i + (j-1) * dist] = v._limbs[j]
return v
end
end
@generated function Base.getindex(A::MFArray{T,M,N}, I::Vararg{Int,N}) where {T,M,N}
quote
$(Expr(:meta, :inline, :propagate_inbounds))
MultiFloat{T,M}(@ntuple $M j -> A.A[I..., j])
end
end
@generated function Base.setindex!(A::MFArray{T,M,N}, v::MultiFloat{T,M}, I::Vararg{Int,N}) where {T,M,N}
quote
$(Expr(:meta, :inline, :propagate_inbounds))
@nexprs $M j -> A.A[I..., j] = v._limbs[j]
return v
end
end
function MFArray(A::AbstractArray{MultiFloat{T,M},N}) where {T,M,N}
A′ = permutedims(reshape(reinterpret(T, A), M, size(A)...), circshift(OneTo(N+1), -1))
return MFArray{T,M,N,typeof(A′)}(A′)
end
@inline function view(A::MFArray{T,M,N}, I::Vararg{Any,N}) where {T,M,N}
B = view(A.A, I..., :)
return MFArray{T,M,ndims(B)-1,typeof(B)}(B)
end
results in the following:
julia> function simple_muladd(C, A, B)
@inbounds @simd ivdep for i ∈ eachindex(A)
C[i] += A[i] * B[i]
end
return nothing
end
simple_muladd (generic function with 1 method)
julia> @benchmark simple_muladd(C, A, B) setup=(A=MFArray(fill(Float64x4(1.0), 100)); B=MFArray(fill(Float64x4(1.0), 100)); C=MFArray(fill(Float64x4(1.0), 100)))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 654.309 ns (0.00% GC)
median time: 673.235 ns (0.00% GC)
mean time: 673.953 ns (0.00% GC)
maximum time: 811.642 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 162
julia> @benchmark simple_muladd(C, A, B) setup=(A=fill(Float64x4(1.0), 100); B=fill(Float64x4(1.0), 100); C=fill(Float64x4(1.0), 100))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 859.617 ns (0.00% GC)
median time: 878.817 ns (0.00% GC)
mean time: 880.429 ns (0.00% GC)
maximum time: 1.439 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 60
The problem however is that it requires @simd ivdep
, probably cause the compiler cannot assume the pointers aren't aliasing.
Using an array like I suggested above does have the downside that it runs into cache thrashing sometimes.
For instance, I have some code to do reduction to band (first step in a parallel schur decomp), and it's slow at 512x512 compared to 513x513:
julia> A = MFArray(rand(Float64x4, 512, 512)); @time to_band!(A, p);
3.015606 seconds (2 allocations: 352 bytes)
julia> A = MFArray(rand(Float64x4, 513, 513)); @time to_band!(A, p);
1.789323 seconds (2 allocations: 352 bytes)
because loading A[i, j, 1]
, A[i, j, 2]
, A[i, j, 3]
, A[i, j, 4]
all land in the same cache set. Without MFArray it looks like this:
julia> A = rand(Float64x4, 512, 512); @time to_band!(A, p);
2.249070 seconds (1 allocation: 336 bytes)
julia> A = rand(Float64x4, 513, 513); @time to_band!(A, p);
2.052841 seconds (1 allocation: 336 bytes)
There's some slowdown here too (I guess when iterating over columns) but it's not as dramatic as above.
As a matter of fact, I actually have a similar struct-of-arrays construct that I started testing privately a long time ago. However, I hesitate to release an array-like construct as part of the public MultiFloats.jl API, since I have no idea how the Julia AbstractArray
type hierarchy works. Do you have a good understanding of what needs to be done in order to make an array-like type work with eachindex
, iterate
, CartesianIndices
, etc.?
In principle size
, getindex
, and setindex!
are sufficient. It's slightly more efficient to implement IndexStyle to get linear indices whenever possible like I did above (so, for plain arrays you get linear indexing, for views you get cartesian indexing).
Only remaining function to implement is similar
, which should ensure that e.g. A[1:3, 1:3]
also returns an MFArray instead of Array{MultiFloat{..}} as it does right now. When that's in it should be good to go.
Btw, I'm trying to write a microkernel for tall-and-skinny QR; what I found is for Q * A it's useful to store Q as an Array{Float64xN} and A with dims permuted (matrix transpose + limbs to 2nd dimension), so I guess MFArray like above isn't the answer to everything.
Closing this old thread since explicit vectorization is now supported in MultiFloats.jl v2.0. I have, for the time being, decided against introducing my own MultiFloatArray
type and instead opted for mfvgather
and mfvscatter
operations on standard Array{MultiFloat}
types.