GemmKernels.jl
GemmKernels.jl copied to clipboard
Use Octavian.jl for large mixed-mode CPU calculations.
LinearAlgebra is hilariously slow for large mixed-mode (i.e. not supported by BLAS) multiplications:
# non mixed-mode
julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 289 samples with 1 evaluation.
Range (min … max): 12.774 ms … 15.729 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.110 ms ┊ GC (median): 0.00%
Time (mean ± σ): 13.218 ms ± 469.316 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁ ▁█▄▄▃▄
▃▆██▇██████▅▇▅▃▃▁▁▁▁▂▁▁▁▂▂▃▁▂▂▁▁▃▁▁▃▁▂▁▁▁▂▂▂▁▁▁▁▁▁▂▁▁▁▂▁▁▃▂▂ ▃
12.8 ms Histogram: frequency by time 15.4 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
# mixed-mode
julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048))
BenchmarkTools.Trial: 8 samples with 1 evaluation.
Range (min … max): 8.342 s … 8.429 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.361 s ┊ GC (median): 0.00%
Time (mean ± σ): 8.375 s ± 28.960 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ █ ▁ ▁ ▁▁ ▁
█▁▁▁▁▁▁▁▁█▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
8.34 s Histogram: frequency by time 8.43 s <
Memory estimate: 20.81 KiB, allocs estimate: 3.
Octavian.jl fares quite a bit better:
julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 452 samples with 1 evaluation.
Range (min … max): 128.814 ms … 132.015 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 129.092 ms ┊ GC (median): 0.00%
Time (mean ± σ): 129.234 ms ± 412.416 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▂ ▂█▂▂
▄▄▆██████████▇▅▆▅▃▅▃▄▃▃▄▄▂▂▃▃▃▃▄▂▃▃▃▃▁▃▂▃▃▁▁▁▃▂▂▂▁▂▃▂▃▃▂▁▃▃▂▃ ▃
129 ms Histogram: frequency by time 130 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
However, replacing all of our LinearAlgebra.mul! uses with Octavian.matmul! regresses test time. @chriselrod is that expected? I guess there's a significant compilation-time overhead for invoking Octavian.jl with many differently typed and sized inputs?
For now, only use Octavian for large mixed-mode cases, which gets test times back to before https://github.com/JuliaGPU/GemmKernels.jl/pull/124.
Benchmark results for commit 4dac74324ab2260be1d7efdf9ab3612a9c0347b1 (comparing to 51bf8ee904b1624e1202b838c9e922ae0cd26e64): No regressions or improvements detected.
Interestingly, this only speeds up 1.9. I can't imagine Octavian.jl being that much slower on <1.9?
Codecov Report
Patch and project coverage have no change.
Comparison is base (
781f1de) 30.27% compared to head (4dac743) 30.27%.
Additional details and impacted files
@@ Coverage Diff @@
## master #125 +/- ##
=======================================
Coverage 30.27% 30.27%
=======================================
Files 11 11
Lines 786 786
=======================================
Hits 238 238
Misses 548 548
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
For timings, I get
julia> @time using Octavian
0.217284 seconds (396.12 k allocations: 21.375 MiB, 6.10% gc time, 6.09% compilation time)
julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 13 samples with 1 evaluation.
Range (min … max): 43.139 ms … 44.684 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 43.791 ms ┊ GC (median): 0.00%
Time (mean ± σ): 43.750 ms ± 447.341 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁ ▁ ▁ ▁ ▁▁█ ▁ ▁ ▁
█▁▁█▁▁▁█▁▁▁█▁█▁▁▁█▁▁▁▁▁▁▁███▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
43.1 ms Histogram: frequency by time 44.7 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048))
BenchmarkTools.Trial: 14 samples with 1 evaluation.
Range (min … max): 42.711 ms … 43.548 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 43.004 ms ┊ GC (median): 0.00%
Time (mean ± σ): 43.067 ms ± 267.509 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁▁ █ █ ▁▁ ▁ ▁ ▁ ▁▁
█▁▁▁▁▁▁██▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁█▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁██ ▁
42.7 ms Histogram: frequency by time 43.5 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 19 samples with 1 evaluation.
Range (min … max): 44.262 ms … 54.795 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 45.080 ms ┊ GC (median): 0.00%
Time (mean ± σ): 47.153 ms ± 3.564 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█▂
▅▅▁██▅▁▅▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▅▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▅ ▁
44.3 ms Histogram: frequency by time 54.8 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.10.0-DEV.1608
Commit 0e8af1c162 (2023-06-30 04:06 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
Threads: 11 on 8 virtual cores
Environment:
JULIA_PATH = @.
LD_LIBRARY_PATH = /usr/local/lib/
JULIA_NUM_THREADS = 8
julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
├ [ILP64] libmkl_rt.so
└ [ LP64] libmkl_rt.so
Which, aside from mul!, are much better timings than you report here.
My laptop isn't a particularly powerful machine.
Perhaps you started Julia with only a single thread?
Although, github actions CI is generally restricted to 1 core, so single threaded is probably representative. I don't know about buildkite.
Interestingly, this only speeds up 1.9. I can't imagine Octavian.jl being that much slower on <1.9?
I'm surprised it isn't <1.8, as 1.8 added --code-coverage=user, which made a tremendous difference vs --code-coverage=all for Octavian.
However, replacing all of our LinearAlgebra.mul! uses with Octavian.matmul! regresses test time. @chriselrod is that expected? I guess there's a significant compilation-time overhead for invoking Octavian.jl with many differently typed and sized inputs?
It should not be compiling for differently sized inputs, only different types. That said, latency is significant; no code-coverage:
julia> C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048);
julia> @time using Octavian
0.205357 seconds (396.14 k allocations: 21.375 MiB, 2.34% gc time, 6.29% compilation time)
julia> @time @eval matmul!(C,A,B);
10.354272 seconds (25.52 M allocations: 1.312 GiB, 2.72% gc time, 99.67% compilation time)
Code coverage:
julia> @time @eval matmul!(C,A,B);
202.818763 seconds (82.94 M allocations: 3.568 GiB, 0.28% gc time, 34.71% compilation time)
But hopefully only GemmKernel's coverage gets taken with --coverage=user?
Thanks for the input!
Yes, we're only using a single thread, as we use multiple processes to run multiple tests in parallel.
However, I had not started with OPENBLAS_NUM_THREADS=1, so the comparison to OpenBLAS above was unfair, and is actually much closer to what you report. Still, running the entire GemmKernels.jl test suite with Octavian.jl is much slower than when using OpenBLAS. I'll have to look into this more closely.
But hopefully only GemmKernel's coverage gets taken with
--coverage=user?
We're just setting coverage=true with Pkg.test. It doesn't seem like that uses --coverage=user, https://github.com/JuliaLang/Pkg.jl/blob/e8197dd0ed8132d4a7619f3657363c8415249c47/src/Operations.jl#L1672-L1681, but I don't think it's doing the equivalent of =all?
But hopefully only GemmKernel's coverage gets taken with
--coverage=user?We're just setting
coverage=truewithPkg.test. It doesn't seem like that uses--coverage=user, https://github.com/JuliaLang/Pkg.jl/blob/e8197dd0ed8132d4a7619f3657363c8415249c47/src/Operations.jl#L1672-L1681, but I don't think it's doing the equivalent of=all?
Disabling coverage on 1.6-1.8 didn't help, so this seems like a different issue.