LoopVectorization.jl icon indicating copy to clipboard operation
LoopVectorization.jl copied to clipboard

Almost always fastest?

Open PallHaraldsson opened this issue 4 years ago • 2 comments
trafficstars

Hi, on "which falls behind MKL's gemm beyond 70x70 or so" is that mostly outdated text? It wasn't obviously true from the graph, and I noticed you reran benchmarks last month (before ArrayInterface upgrade, would 3.0 improve speed?), and couldn't zoom unless going to:

https://github.com/chriselrod/LoopVectorization.jl/blob/5ba0d186bcd2d6f4fed09fd6ca9f7817e8dd29e2/docs/src/assets/bench_AmulB_v2.png

Yes, about there and sometimes for bigger, MKL is only slightly faster (from memory MKL had a much bigger edge), but you might want to change to more positive language. I have and want to keep pointing people to these graphs and your awesome work.

I just recently noticed: https://github.com/JuliaLinearAlgebra/Octavian.jl

Is it fair to say OpenBLAS will soon be replaced? Or could (already)? I know you target Intel with AVX512. The concepts transfer to ARM and AMD, and even some code already for AMD?

As with: https://github.com/JuliaGPU/GemmKernels.jl

you need no assembly? I mean on some level, but not for high-level (multiply) functions.

I didn't see (or expect) any common code there with your. I did notice GPUifyLoops.jl which is archived and should use KernelAbstractions.jl?

PallHaraldsson avatar Feb 01 '21 14:02 PallHaraldsson

I may have the wrong graph... if the text didn't apply to the graph above, and I do see at least MKL is (still) sometimes ahead in the one below.

Anyway, there's a typo "tranposed", that can at least be fixed.

and one other thing, these stats are off:

Julia 95.9% Fortran 2.6% C 1.1% C++ 0.4%

Right? My guess from some benchmark code that's now a separate package. I'm not sure how you can force an update or it happens eventually automatically.

Would it help (for matmul) to have a different memory organization, neither row- or column-major, but some block structure (like tiling for framebuffers, even Nvidia switched to eventually) for locality? It would make indexing otherwise harder (code code, icache use), but maybe not slower. Are you aware of such Julia packages or research on?

PallHaraldsson avatar Feb 01 '21 14:02 PallHaraldsson

Hi, on "which falls behind MKL's gemm beyond 70x70 or so" is that mostly outdated text?

Yes, see how far things have come since the ANN: LoopVectorization post on discourse in January. Notice how it did fall behind MKL around 60x60 in that plot. Versus now, where MKL actually dips before LoopVectorization does. Note that LoopVectorization (without the aid of cache tiling) will dip very hard at some point, with how hard/fast and at what sizes depending on the architecture. That's why we need libraries like Octavian that implement optimizations (possibly including reshaping and permuting the underlying data like you describe).

Is it fair to say OpenBLAS will soon be replaced? Or could (already)?

Maybe, but by libtrampolineblas. libtrampolineblas will make it easy to switch BLAS libraries at runtime.

Octavian is a ways off, in that it just has gemm at the moment. My most immediate plans for extension are to cover more machine learning use cases, e.g. supporting dense layers with activation functions fused into the same multithreaded multiplication.

Longer term, I have plans and some early work toward LoopVectorization directly supporting the things needed by BLAS3 routines like trsm or LAPACK potrf, etc. These are of course a hugely important part of a full BLAS library.

I know you target Intel with AVX512. The concepts transfer to ARM and AMD, and even some code already for AMD?

I do, but my goal is to be fast everywhere. Octavian's benchmarks look very good on AMD, see the benchmarks in the following comment and also in Mason's following it: https://github.com/JuliaLinearAlgebra/Octavian.jl/issues/24#issuecomment-766185684

you need no assembly?

LoopVectorization generates the kernels using LLVM intrinsics. The code in Octavian is mostly high level, and the kernels themselves are just @avx for... loops.

I didn't see (or expect) any common code there with your. I did notice GPUifyLoops.jl which is archived and should use KernelAbstractions.jl?

Maybe that's worth an issue over there. There are plans to eventually merge GemmKernels.jl into CUDA.jl or GPUArrays.jl. It'd be great to have a cross platform Julia GPU matrix multiplication implementation/library.

Anyway, there's a typo "tranposed", that can at least be fixed.

Good catch. As you note, could probably use a pretty big overhaul.

and one other thing, these stats are off:

Julia 95.9% Fortran 2.6% C 1.1% C++ 0.4%

Right? My guess from some benchmark code that's now a separate package. I'm not sure how you can force an update or it happens eventually automatically.

Currently, it is in the benchmark directory, which includes Fortran, C, and C++ source files for benchmarking against. But they are not a part of LoopVectorization, so GitHub reporting LoopVectorization as including Fortran, C, or C++ is misleading. Maybe I should move the benchmark code to a separate package.

Would it help (for matmul) to have a different memory organization, neither row- or column-major, but some block structure (like tiling for framebuffers, even Nvidia switched to eventually) for locality? It would make indexing otherwise harder (code code, icache use), but maybe not slower. Are you aware of such Julia packages or research on?

Yes, "tile major" memory layouts have an advantage in locality.

julia> using LoopVectorization, VectorizationBase

julia> Mᵣ, Nᵣ = LoopVectorization.matmul_params()
(static(3), static(9))

julia> Mᵣ * VectorizationBase.pick_vector_width(Float64), Nᵣ
(static(24), static(9))

As an aside, matmul_params doesn't do anything special other than run LoopVectorization's analysis on some gemm loops. It's there so that libraries like Octavian can use it. It should perhaps be moved there, but MaBLAS.jl was also interested, so the code was in LoopVectorization to avoid duplicating that code across every library wanting it.

Back to the point on memory layout, the numbers Mᵣ and Nᵣ give how much LoopVectorization wants to unroll the M and N loops given M by N matrix C, where C = A * B. Multiplying M by the vector width gives total unrolling, as LoopVectorization will generally want to vectorize that loop as well.

So this means that the microkernel on this system is 24x9.

If you pack blocks of A like this:

MᵣW = Mᵣ * VectorizationBase.pick_vector_width(Float64)
m = 1; k = 1; # as we iterate over blocks, m and k will increase
Aview = view(A, (1:(MᵣW*mcf)) .+ m, (1:Kc) .+ k);
Apack = permutedims(reshape(Aview, (MᵣW, mcf, Kc), (1, 3, 2))

Now you will access all elements of Apack in the order they're stored in memory.

Similarly, for B:

n = 1; # same `k` as above
Bview = view(B, (1:Kc) .+ k, (1:(Nᵣ*ncf)) .+ n);
Bpack = permutedims(reshape(Bview, (Kc, Nᵣ, ncf)), (2, 1, 3))

Now the elements of Bpack will be accessed in the order they are stored in memory. This is done by libraries like BLIS and OpenBLAS, but not currently by Octavian.

I've added some preliminary support for representing these layouts more conveniently, and plan to make LoopVectorization aware of them for easier experimentation. We can of course work with the 3D arrays, but this doesn't work so well when our Aview or Bview's dimension isn't actually an integer multiple of MᵣW or Nᵣ. Another issue to be addressed is that LoopVectorization isn't that good at transposing loads right now. I finally added support for doing this efficiently to VectorizationBase yesterday, but this is using a new API I haven't switched LoopVectorization over to yet at large (but vmap! is already using it, and many special functions are faster with vmap! than with @avx as a result, but vmap! doesn't support transposed arrays and is super simple -- which is why it was easy to move it over first).

chriselrod avatar Feb 01 '21 15:02 chriselrod