DoubleFloats.jl approaching better vectorization

trafficstars

@chriselrod @ffevotte I could use some help providing better vectorization over basic arithmetic ops for DoubleFloats. I have tried to follow the guidelines you have given without seeing much any change. Underlying addition and multiplication, there are three core routines. Being able to vectorize well over them would go a long way.

Multiplication is faster than addition of Double64s. Both are needed to accelerate matrix ops. Please ask questions.

# Double64s are structs

struct Double64
   hi::Float64
   lo::Float64
end

# the underlying support for arithmetic

"""
    two_prod(a, b)

Computes `hi = fl(a*b)` and `lo = err(a*b)`.
"""
@inline function two_prod(a::T, b::T) where {T<:Float64}
    hi = a * b
    lo = fma(a, b, -s)
    return hi, lo
end

"""
    two_sum(a, b)

Computes `hi = fl(a+b)` and `lo = err(a+b)`.
"""
@inline function two_sum(a::T, b::T) where {T<:Float64}
    hi = a + b
    a1 = hi - b
    b1 = hi - a1
    lo = (a - a1) + (b - b1)
    return hi, lo
end

# there is an alternative implementation of two_sum.
@inline function two_sum(a::T, b::T) where {T<:Float64}
    hi = a + b
    v  = hi - a
    lo = (a - (hi - v)) + (b - v)
    return hi, lo
end

# this is a special case of two_sum when the relative magnitudes are known
#   it is used heavily to gain some speed where allowable 
"""
    two_hilo_sum(a, b)

*unchecked* requirement `|a| ≥ |b|`
Computes `hi = fl(a+b)` and `lo = err(a+b)`.
"""
@inline function two_hilo_sum(a::T, b::T) where {T<:Float64}
    hi = a + b
    lo = b - (s - a)
    return hi, lo
end

# this is the support for division (called `two_dvi` in the source)
"""
    two_div(a, b)

Computes `hi = fl(a/b)` and `lo = err(a/b)`.
"""
@inline function two_div(a::T, b::T) where {T<:Float64}
     hi = a / b
     lo = fma(-hi, b, a)
     lo /= b
     return hi, lo
end

# The basic arithmetic `add` and `mul` for Double64s

# relative error < 3u² (called `add_dddd_dd` in the source)
# each Tuple gives the Double64 struct fields `(hi, lo)`
#
# >>>> It is easy to pass the structs instead if that is easier to vectorize
# 
@inline function add(x::Tuple{T,T}, y::Tuple{T,T}) where T<:Float64
    xhi, xlo = x
    yhi, ylo = y
    hi, lo = two_sum(xhi, yhi)
    thi, tlo = two_sum(xlo, ylo)
    c = lo + thi
    hi, lo = two_hilo_sum(hi, c)
    c = tlo + lo
    hi, lo = two_hilo_sum(hi, c)
    return hi, lo
end


# relative error <= 5u² (called `mul_dddd_dd` in the source)
@inline function mul(x::Tuple{T,T}, y::Tuple{T,T}) where T<:Float64
    xhi, xlo = x
    yhi, ylo = y
    hi, lo = two_prod(xhi, yhi)
    t = xlo * ylo
    t = fma(xhi, ylo, t)
    t = fma(xlo, yhi, t)
    t = lo + t
    hi, lo = two_hilo_sum(hi, t)
    return hi, lo
end

# called `dvi_dddd_dd` in the source
@inline function divide(x::Tuple{T,T}, y::Tuple{T,T}) where T<:Float64
    xhi, xlo = x
    yhi, ylo = y
    hi = xhi / yhi
    uh, ul = two_prod(hi, yhi)
    lo = ((((xhi - uh) - ul) + xlo) - hi*ylo)/yhi
    hi,lo = two_hilo_sum(hi, lo)
    return hi, lo
end

Mar 26 '20 01:03 JeffreySarnoff

Hello Jeffrey,

there is one thing that is not clear to me yet: what would be the objective of the vectorization? I can see two very different approaches:

trying to make +(::Double64, ::Double64) faster by using vector instructions in the underlying EFTs, or
trying to make +(::Vector{Double64}, ::Vector{Double64}) faster by allowing double-double numbers to be handled in packs of 4.

Approach 1. seems hard, because I'm not sure there is enough SIMD-compatible work to do in the EFTs. On the other hand, in approach 2 it would be more straightforward to know which operations can be performed in a SIMD way; but in this case I think difficulties would lie in the data layout: I expect that it would be better to have a structure of arrays, rather than an array of Double64 structures

Mar 26 '20 13:03 ffevotte

Greetings François,

I did not know that one thing! So I did not guess about effectiveness: whether (1) or (2) with judicious use of LLVM/SIMD/AVX aor LoopVectorization/AccurateArithmetic for each as appropriate, would be most beneficial, attainable and maintainable.

I do understand your thinking. Perhaps first we see how; once I know appropriate ways to code to this, I'll push it and develop benchmarking help. It is important at the start to pursue (2) without constraining the availability of gofaster through (1). When reviewing the generated code, the primitives (two_prod more than two_sum) looked ok with some room for improvement (there is opportunity for more v prefixed opcodes). The arithmetic ops built upon the primitives generate code that appears less flexible / SIMD-ready than necessary. So, I hope, (2) and (1), differently rather than (2) and not (1).

Best, Jeffrey

Mar 26 '20 14:03 JeffreySarnoff

I tried this (requires LoopVectorization 0.6.24 or newer):

using LoopVectorization
function gemm_accurate_kernel!(C, A, B)
    @avx for n in 1:size(C,2), m in 1:size(C,1)
        Cmn_hi = zero(eltype(C))
        Cmn_lo = zero(eltype(C))
        for k in 1:size(B,1)
            hiprod = evmul(A[m,k], B[k,n])
            loprod = vfmsub(A[m,k], B[k,n], hiprod)
            hi_ts = evadd(hiprod, Cmn_hi)
            a1_ts = evsub(hi_ts, Cmn_hi)
            b1_ts = evsub(hi_ts, a1_ts)
            lo_ts = evadd(evsub(hiprod, a1_ts), evsub(Cmn_hi, b1_ts))
            thi = evadd(loprod, Cmn_lo)
            a1_t = evsub(thi, Cmn_lo)
            b1_t = evsub(thi, a1_t)
            tlo = evadd(evsub(loprod, a1_t), evsub(Cmn_lo, b1_t))
            c1 = evadd(lo_ts, thi)
            hi_ths = evadd(hi_ts, c1)
            lo_ths = evsub(c1, evsub(hi_ths, hi_ts))
            c2 = evadd(tlo, lo_ths)
            Cmn_hi = evadd(hi_ths, c2)
            Cmn_lo = evsub(c2, evsub(Cmn_hi, hi_ths))
        end
        C[m,n] = Cmn_hi
    end
    C
end

But it doesn't seem to work. I may have made a mistake following the double float product and sum functions. Maybe I should follow the AccurateArtihmetic dot product algorithm.

Example of it not working:

using LinearAlgebra
r = 1e-14:99+1e-14;
qrb = qr(rand(length(r), length(r)));
Q = Matrix{BigFloat}(qrb.Q); # I wanted a random orthogonal matrix more interesting than I
Ab = Q * Diagonal(r) * Q';
Abi = Q * Diagonal(1 ./ r) * Q';
Ab64 = Float64.(Ab); Abi64 = Float64.(Abi);
cond(Ab64), cond(Abi64)
# (6.659871254102312e15, 2.063196098117369e16)
last(r) / first(r) # Should be equal to the above
# 9.900000000000002e15
C = gemm_accurate_kernel!(similar(Ab64), Ab64, Abi64);
sum(abs, C - I), sum(abs, Ab64 * Abi64 - I) # Should be 0
# (241.8293068059044, 251.69357370578118)

Improvement is pretty negligible.

EDIT: Actually, maybe it's doing quite well

julia> sum(abs, Ab * Abi - I)
241.0932487642165916894920436948567838194347574286817003393540001216488095375455

I think the problem is that my Q matrix isn't actually perfectly orthogonal.

Just calculating Abi using inv:

julia> Abiv2 = inv(Ab);

julia> sum(abs, Ab * Abiv2 - I)
6.952146822190708017887559626198144268623694722432354117472718405806550794109339495669806009628687039895162178322961830680289099636242996590885749743681880768558283646634882983369521213016458761115018330939573505303406194198078809992514064880960725997856230716739219579947204304410369227438083786821415686911706e-291

julia> Abi64 = Float64.(Abiv2);

julia> gemm_accurate_kernel!(C, Ab64, Abi64);

julia> sum(abs, C - I), sum(abs, Ab64 * Abi64 - I)
(22.301547859434912, 63.88009463823736)

Not great.

Mar 29 '20 02:03 chriselrod

I'm not sure what an EFT is, but from context I assume it means the individual operations in the functions you described. In which case I agree with ffevotte, in that I don't think much gain there is possible.

When it comes to vectorization, the simplest approach is generally to do multiple operations in parallel. So if you have a loop, calculate multiple iterations of the loop in parallel using SIMD instructions.

Structs of arrays would definitely make things easier/faster, but if you want to support arrays of structs, you could load 2 vectors of the interleaved hi/lo elements, and then use vector shuffles to get the homogeneous vectors of hi and lo elements. It is of course faster not to need permutes, which is why structs of arrays are faster, but because you only need two on loading (and 2 on storing), it's fast enough that the approach should still give a nice performance boost if you can't change the data layout.

Mar 29 '20 02:03 chriselrod

DoubleFloats.jl DoubleFloats.jl copied to clipboard

approaching better vectorization

DoubleFloats.jl
DoubleFloats.jl copied to clipboard