GemmKernels.jl icon indicating copy to clipboard operation
GemmKernels.jl copied to clipboard

Use Octavian.jl for large mixed-mode CPU calculations.

Open maleadt opened this issue 2 years ago • 7 comments

LinearAlgebra is hilariously slow for large mixed-mode (i.e. not supported by BLAS) multiplications:

# non mixed-mode
julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 289 samples with 1 evaluation.
 Range (min … max):  12.774 ms …  15.729 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.110 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.218 ms ± 469.316 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▁ ▁█▄▄▃▄
  ▃▆██▇██████▅▇▅▃▃▁▁▁▁▂▁▁▁▂▂▃▁▂▂▁▁▃▁▁▃▁▂▁▁▁▂▂▂▁▁▁▁▁▁▂▁▁▁▂▁▁▃▂▂ ▃
  12.8 ms         Histogram: frequency by time         15.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

# mixed-mode
julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048))
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  8.342 s …   8.429 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.361 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.375 s ± 28.960 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁        █ ▁ ▁                    ▁▁                    ▁
  █▁▁▁▁▁▁▁▁█▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  8.34 s         Histogram: frequency by time        8.43 s <

 Memory estimate: 20.81 KiB, allocs estimate: 3.

Octavian.jl fares quite a bit better:

julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 452 samples with 1 evaluation.
 Range (min … max):  128.814 ms … 132.015 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     129.092 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   129.234 ms ± 412.416 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▂▂ ▂█▂▂
  ▄▄▆██████████▇▅▆▅▃▅▃▄▃▃▄▄▂▂▃▃▃▃▄▂▃▃▃▃▁▃▂▃▃▁▁▁▃▂▂▂▁▂▃▂▃▃▂▁▃▃▂▃ ▃
  129 ms           Histogram: frequency by time          130 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

However, replacing all of our LinearAlgebra.mul! uses with Octavian.matmul! regresses test time. @chriselrod is that expected? I guess there's a significant compilation-time overhead for invoking Octavian.jl with many differently typed and sized inputs?

For now, only use Octavian for large mixed-mode cases, which gets test times back to before https://github.com/JuliaGPU/GemmKernels.jl/pull/124.

maleadt avatar Jul 02 '23 08:07 maleadt

Benchmark results for commit 4dac74324ab2260be1d7efdf9ab3612a9c0347b1 (comparing to 51bf8ee904b1624e1202b838c9e922ae0cd26e64): No regressions or improvements detected.

maleadt avatar Jul 02 '23 08:07 maleadt

Interestingly, this only speeds up 1.9. I can't imagine Octavian.jl being that much slower on <1.9?

maleadt avatar Jul 02 '23 09:07 maleadt

Codecov Report

Patch and project coverage have no change.

Comparison is base (781f1de) 30.27% compared to head (4dac743) 30.27%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #125   +/-   ##
=======================================
  Coverage   30.27%   30.27%           
=======================================
  Files          11       11           
  Lines         786      786           
=======================================
  Hits          238      238           
  Misses        548      548           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Jul 02 '23 09:07 codecov[bot]

For timings, I get

julia> @time using Octavian
  0.217284 seconds (396.12 k allocations: 21.375 MiB, 6.10% gc time, 6.09% compilation time)

julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 13 samples with 1 evaluation.
 Range (min … max):  43.139 ms …  44.684 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.791 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   43.750 ms ± 447.341 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁  ▁   ▁   ▁ ▁   ▁       ▁▁█              ▁ ▁              ▁  
  █▁▁█▁▁▁█▁▁▁█▁█▁▁▁█▁▁▁▁▁▁▁███▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  43.1 ms         Histogram: frequency by time         44.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048))
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  42.711 ms …  43.548 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.004 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   43.067 ms ± 267.509 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁      ▁▁  █ █              ▁▁ ▁   ▁         ▁            ▁▁  
  █▁▁▁▁▁▁██▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁█▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁██ ▁
  42.7 ms         Histogram: frequency by time         43.5 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 19 samples with 1 evaluation.
 Range (min … max):  44.262 ms … 54.795 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     45.080 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   47.153 ms ±  3.564 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▂                                                        
  ▅▅▁██▅▁▅▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▅▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▅ ▁
  44.3 ms         Histogram: frequency by time        54.8 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.10.0-DEV.1608
Commit 0e8af1c162 (2023-06-30 04:06 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 11 on 8 virtual cores
Environment:
  JULIA_PATH = @.
  LD_LIBRARY_PATH = /usr/local/lib/
  JULIA_NUM_THREADS = 8

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
├ [ILP64] libmkl_rt.so
└ [ LP64] libmkl_rt.so

Which, aside from mul!, are much better timings than you report here. My laptop isn't a particularly powerful machine. Perhaps you started Julia with only a single thread?

Although, github actions CI is generally restricted to 1 core, so single threaded is probably representative. I don't know about buildkite.

chriselrod avatar Jul 02 '23 09:07 chriselrod

Interestingly, this only speeds up 1.9. I can't imagine Octavian.jl being that much slower on <1.9?

I'm surprised it isn't <1.8, as 1.8 added --code-coverage=user, which made a tremendous difference vs --code-coverage=all for Octavian.

However, replacing all of our LinearAlgebra.mul! uses with Octavian.matmul! regresses test time. @chriselrod is that expected? I guess there's a significant compilation-time overhead for invoking Octavian.jl with many differently typed and sized inputs?

It should not be compiling for differently sized inputs, only different types. That said, latency is significant; no code-coverage:

julia> C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048);   

julia> @time using Octavian
  0.205357 seconds (396.14 k allocations: 21.375 MiB, 2.34% gc time, 6.29% compilation time)

julia> @time @eval matmul!(C,A,B);
 10.354272 seconds (25.52 M allocations: 1.312 GiB, 2.72% gc time, 99.67% compilation time)

Code coverage:

julia> @time @eval matmul!(C,A,B);

202.818763 seconds (82.94 M allocations: 3.568 GiB, 0.28% gc time, 34.71% compilation time)

But hopefully only GemmKernel's coverage gets taken with --coverage=user?

chriselrod avatar Jul 02 '23 09:07 chriselrod

Thanks for the input!

Yes, we're only using a single thread, as we use multiple processes to run multiple tests in parallel. However, I had not started with OPENBLAS_NUM_THREADS=1, so the comparison to OpenBLAS above was unfair, and is actually much closer to what you report. Still, running the entire GemmKernels.jl test suite with Octavian.jl is much slower than when using OpenBLAS. I'll have to look into this more closely.

But hopefully only GemmKernel's coverage gets taken with --coverage=user?

We're just setting coverage=true with Pkg.test. It doesn't seem like that uses --coverage=user, https://github.com/JuliaLang/Pkg.jl/blob/e8197dd0ed8132d4a7619f3657363c8415249c47/src/Operations.jl#L1672-L1681, but I don't think it's doing the equivalent of =all?

maleadt avatar Jul 02 '23 19:07 maleadt

But hopefully only GemmKernel's coverage gets taken with --coverage=user?

We're just setting coverage=true with Pkg.test. It doesn't seem like that uses --coverage=user, https://github.com/JuliaLang/Pkg.jl/blob/e8197dd0ed8132d4a7619f3657363c8415249c47/src/Operations.jl#L1672-L1681, but I don't think it's doing the equivalent of =all?

Disabling coverage on 1.6-1.8 didn't help, so this seems like a different issue.

maleadt avatar Jul 03 '23 08:07 maleadt