ArrayLayouts.jl
ArrayLayouts.jl copied to clipboard
Extremely slow BLAS muladd when α,β are Int
Hi, perhaps I do not understand something about how ArrayLayouts
are supposed to work and be used, but it seems that BLAS matrix multiplication (both default_blasmul!
and tiled_blasmul!
) is very slow in comparison to 5-argument LinearAlgebra.mul!
.
Below is a reproducible example.
using LinearAlgebra
import ArrayLayouts: default_blasmul!, tiled_blasmul!, tile_size
A, B, C = (rand(1000, 1000) for _ in 1:3)
# mul! is fast
mul!(copy(C), A, B, 1, 1) ≈ A*B + C # true
@btime mul!(copy(C), A, B, 1, 1);
# 6.448 ms (4 allocations: 7.63 MiB)
# default_blasmul! is super slow!!
default_blasmul!(1, A, B, 1, copy(C)) ≈ A*B + C # true
@btime default_blasmul!(1, A, B, 1, copy(C));
# 852.139 ms (3 allocations: 7.63 MiB)
# tiled_blasmul! is also very slow!
ts = tile_size(eltype(A), eltype(B), eltype(C))
tiled_blasmul!(ts, 1, A, B, 1, copy(C)) ≈ A*B + C # true
@btime tiled_blasmul!(ts, 1, A, B, 1, copy(C));
# 402.555 ms (12 allocations: 7.66 MiB)
And this of course affects operations using MulAdd
wrapper. What am I doing wrong? Or is it really a bug?
Version info
Julia Version 1.11.3
Commit d63adeda50d (2025-01-21 19:42 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 9 5900HX with Radeon Graphics
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
Environment:
JULIA_PROJECT = @.
JULIA_PKG_PRESERVE_TIERED_INSTALLED = true
JULIA_REVISE = manual
PKG status
[dce04be8] ArgCheck v2.4.0
[7d9fca2a] Arpack v0.5.4
[4c555306] ArrayLayouts v1.11.0
[86223c79] Graphs v1.12.0
⌃ [0b1a1467] KrylovKit v0.9.3
⌃ [5078a376] LazyArrays v2.4.0
[2ab3a3ac] LogExpFunctions v0.3.29
[5a0628fe] Supposition v0.3.5
[37e2e46d] LinearAlgebra v1.11.0
[2f01184e] SparseArrays v1.11.0
[f489334b] StyledStrings v1.11.0
[8dfed614] Test v1.11.0