trafficstars

It would be nice if we provide a macro that replaces functions with their vectorized version.

Like @ivm @. sin(x) would replace this with IntelVectorMath function, and @applacc @. sin(x) calls AppleAccelerate.

We can provide such macros from IntelVectorMath.jl too, or else maybe having all of them in one place like inside LoopVectorization.jl.

@chriselrod quotes:

The major improvement these provide is that they're vectorized. If x is a scalar, then there isn't much benefit, if there is any at all. Version of LoopVectorization provided an @vectorize macro (that has since been removed) which naively swapped calls, and made loops incremented (ie, instead of 1:N, it would be 1:W:N, plus some code to handle the remainder). @avx does this better.

If they are a vector, calling @avx sin.(x) or IntelVectorMath.sin(x) work (although a macro could search a whole block of code and swap them to use IntelVectorMath.

I've been planning on adding "loop splitting" support in LoopVectorization for a little while now (splitting one loop into several). It would be possible to extend this to moving special functions into their own "loop" (a single vectorized call) and using VML (or some other library).

I would prefer "short vector" functions in general. Wouldn't require any changes to the library to support, nor would it require special casing. E.g, this works well with AVX2:

julia> using LinearAlgebra, LoopVectorization, BenchmarkTools

julia> U = randn(200, 220) |> x -> cholesky(Symmetric(x * x')).U;

julia> function triangle_logdet(A::Union{LowerTriangular,UpperTriangular})
           ld = zero(eltype(A))
           @avx for i in 1:size(A,1)
               ld += log(A[i,i])
           end
           ld
       end
triangle_logdet (generic function with 1 method)

julia> @btime logdet($U)
  2.131 μs (0 allocations: 0 bytes)
462.0132368439299

julia> @btime triangle_logdet($U)
  1.076 μs (0 allocations: 0 bytes)
462.0132368439296

julia> Float64(sum(log ∘ big, diag(U)))
462.0132368439296

Presumably, VML does not handle vectors with a stride other than 1, which would force me to copy the elements, log them, and then sum them if I wanted to use it there. Assuming it's able to use some pre-allocated buffer...

julia> y3 = similar(diag(U));

julia> function triangle_logdet_vml!(y, A::Union{LowerTriangular, UpperTriangular})
           @avx for i ∈ 1:size(A,1)
               y[i] = A[i,i]
           end
           IntelVectorMath.log!(y, y)
           ld = zero(eltype(y))
           @avx for i ∈ eachindex(y)
               ld += y[i]
           end
           ld
       end
triangle_logdet_vml! (generic function with 1 method)

julia> @btime triangle_logdet_vml!($y3, $U)
  697.691 ns (0 allocations: 0 bytes)
462.0132368439296

It looks like all that effort would pay off, so I'm open to it. Long term I would still be in favor of implementing more of these special functions in Julia or LLVM, but this may be the better short term move. I also don't see many people jumping at the opportunity to implement SIMD versions of special functions (myself included).

Too bad VML isn't more expansive. Adding it wouldn't do much to increase the number of special functions currently supported by SLEEFPirates/LoopVectorization. I've been wanting a digamma function, for example. I'll probably try the approach suggested by Wikipedia.

How well does VML perform on AMD? Is that something I'd have to worry about?

EDIT: With AVX512:

julia> using LinearAlgebra, LoopVectorization, IntelVectorMath, BenchmarkTools

julia> U = randn(200, 220) |> x -> cholesky(Symmetric(x * x')).U;

julia> function triangle_logdet(A::Union{LowerTriangular,UpperTriangular})
           ld = zero(eltype(A))
           @avx for i in 1:size(A,1)
               ld += log(A[i,i])
           end
           ld
       end
triangle_logdet (generic function with 1 method)

julia> @btime logdet($U)
  1.426 μs (0 allocations: 0 bytes)
463.5193875385334

julia> @btime triangle_logdet($U)
  234.677 ns (0 allocations: 0 bytes)
463.5193875385336

julia> Float64(sum(log ∘ big, diag(U)))
463.51938753853364

julia> y3 = similar(diag(U));

julia> function triangle_logdet_vml!(y, A::Union{LowerTriangular, UpperTriangular})
           @avx for i ∈ 1:size(A,1)
               y[i] = A[i,i]
           end
           IntelVectorMath.log!(y, y)
           ld = zero(eltype(y))
           @avx for i ∈ eachindex(y)
               ld += y[i]
           end
           ld
       end
triangle_logdet_vml! (generic function with 1 method)

julia> @btime triangle_logdet_vml!($y3, $U)
  411.110 ns (0 allocations: 0 bytes)
463.51938753853364

With AVX512, it uses this log definition. I'd be more inclined to add something similar for AVX2. For this benchmark, the Intel compilers produce faster code.

My response:

I just wanted to clarify the thing I mean in this issue, so everyone is on the same page.

We can consider 3 kinds of syntax for the macro (I use @ivm to avoid confusion):

A simple macro that only searches the given Expr for the functions that IntelVectorMath provides and adds IVM. before their name:

a = rand(100)
@ivm sin.(a) .* cos.(a) .* sum.(a)

should be translated to:

IVM.sin(a) .* IVM.cos(a) .* sum.(a)

A macro that converts broadcast to IVM call (which I think is more inline with your example):

a = rand(100)
@ivm sin.(a) .* cos.(a)

which similar to 1 is translated to:

IVM.sin(a) .* IVM.cos(a)

But in this case other functions can use a for loop with @avx on them:

a = rand(100)
@ivm sin.(a) .* cos.(a) .* sum.(a)

should be translated to:

out = Vector{eltype(a)}(undef, length(a))

temp = IVM.sin(a) * IVM.cos(a) 
@avx for i=1:length(a)
  out[i] = temp * sum(a[i])
end
out

or similar to (2) but more efficient (probably). We can fuse the loops (internal IntelVectorMath loop and the for loop) together and use IntelVectorMath only for 1 element:

out = Vector{eltype(a)}(undef, length(a))
@avx for i=1:length(a)
  out[i] = IVM.sin(a[i])[1] * IVM.cos(a[i])[1] * sum(a[i])
end
out

When someone uses @ivm that means they want to transform sin to IVM.sin. Multiple lib usage:

(@ivm sin.(a).*sin.(b)).*Base.sin.(a)

So which one is the syntax that we want to consider?

Scalar methods: https://github.com/JuliaMath/IntelVectorMath.jl/issues/44

Came up in: https://github.com/JuliaMath/IntelVectorMath.jl/issues/22#issuecomment-582059753, https://github.com/JuliaMath/IntelVectorMath.jl/issues/43, https://github.com/JuliaMath/IntelVectorMath.jl/pull/42,

Mar 07 '20 00:03 aminya

If all we want to do is add code that doesn't otherwise mix with LoopVectorization, I think it should go into it's own standalone library. Currently, on some architecture & OS combinations, LoopVectorization is already using faster special function implementations than IntelVectorMath, so it would take work to integrate it well, without it always resulting in an upside.

Option 3. will be much less efficient than both 2 and what LoopVectorization does now. On a Linux machine with AVX512 and a recent version of GLIBC:

julia> using IntelVectorMath, LoopVectorization, BenchmarkTools

julia> x = rand(1000); y1 = similar(x); y2 = similar(x); y3 = similar(x);

julia> @benchmark $y1 .= sin.($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.937 μs (0.00% GC)
  median time:      5.100 μs (0.00% GC)
  mean time:        5.106 μs (0.00% GC)
  maximum time:     7.883 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark IntelVectorMath.sin!($y2, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.720 μs (0.00% GC)
  median time:      1.731 μs (0.00% GC)
  mean time:        1.735 μs (0.00% GC)
  maximum time:     2.960 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark @avx $y3 .= sin.($x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     637.274 ns (0.00% GC)
  median time:      642.571 ns (0.00% GC)
  mean time:        643.363 ns (0.00% GC)
  maximum time:     894.708 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     168

I would be in favor of adding support for special functions that aren't already supported would be a good thing, and I'm a fan of performance improvements in general. This example, as well as the logdet example, can both be improved with IntelVectorMath on certain setups (like the Haswell CPU with an old GLIBC that doesn't provide libmvec) if you use an approach like 2.. I'd be in favor of improving performance on those machines, so long as it doesn't hurt performance on others, (meaning it should decide what to do based on whether or not building SLEEFPirates successfully found a libmvec, and possibly other criteria as well).

I do think a better solution for improving special function performance is to work on improving or replacing the implementations in SLEEFPirates.jl.

Mar 07 '20 04:03 chriselrod

We should also use functions that https://github.com/tkf/ThreadsX.jl provides

Mar 10 '20 16:03 aminya

If all we want to do is add code that doesn't otherwise mix with LoopVectorization, I think it should go into its own standalone library. Currently, on some architecture & OS combinations, LoopVectorization is already using faster special function implementations than IntelVectorMath, so it would take work to integrate it well, without it always resulting in an upside.

I am not sure about the way we should take.

From one side, @KristofferC says we should keep libraries dumb, and from the other side, I totally get your point about loop fusion and low-level parallelization (what @avx does)

I can think of two main issues for manual low-level Parallelization for everything:

Not always we want to fuse the loops. In many applications, we calculate things sequentially. we just wanna throw @avx for parallelizing as many as functions possible

# An arbitrary example:
@avx begin
  # x, y, z have same size
  y = sin(x)
  z = cos(y)
  # a, b have same size
  a = log(b)
  w = (z.*y)*a
end

For Intel, the source is closed, so we can only use the APIs (oneAPI functions)
For some functions like in https://github.com/tkf/ThreadsX.jl doing it again to fuse the loops may get hard.

Mar 10 '20 16:03 aminya

I think when you have such efficient looping machine there is no need to use VML which only adds overhead.

It would be great to be able to use element wise (SIMD Element) libraries in a loop (Intel SVML, Sleef and as you showed some functions in GLIBC).

Do you think it will be part of this package or should it be a dedicated package?

Apr 17 '20 09:04 RoyiAvital

This package uses SLEEFPirates, which includes a mix of functions ported from SLEEF 2, llvmcall of compiling SLEEF 3 with the option to emit llvm bitcode, and sometimes GLIBC.

More "elementwise" (short-vector [i.e., vectors of SIMD-vector-width]) functions, and better implementations, are always welcome. I didn't check how Intel's compilers compare on error, but they're faster on basically any code I tried that uses special functions.

Apr 17 '20 16:04 chriselrod

This is nice. I wasn't aware of that. I thought the package only gives something like Intel Compiler vector Pragma.

Can you compare it also to use of the latest Sleef? Regarding accuracy, when I compared them Intel was as accurate if not more while being faster.

Anyhow, my point was if one generate efficient loop no reason to use VML (When talking on Single Thread). Though probably one day Multi Threading + SIMD will be merged (Then VML will be just overhead).

By the way, it would be nice to annotate @avx with more information (I would also use @vectorize instead of @avx so it won't be linked to the ISA).

Apr 17 '20 16:04 RoyiAvital

LoopVectorization.jl
LoopVectorization.jl copied to clipboard

Predefined functions Parallelization (e.g. IntelVectorMath functions)

@chriselrod quotes:

My response:

LoopVectorization.jl LoopVectorization.jl copied to clipboard

Predefined functions Parallelization (e.g. IntelVectorMath functions)

@chriselrod quotes:

My response:

LoopVectorization.jl
LoopVectorization.jl copied to clipboard