Chris Elrod
Chris Elrod
But an example of what I'm seeing: ```julia julia> include("/home/chriselrod/.julia/dev/PaddedMatrices/benchmark/blasbench.jl") plot (generic function with 3 methods) julia> mkl_set_num_threads(18) julia> openblas_set_num_threads(18) julia> M = K = N = 80; julia> A...
But to be fair, performance isn't great at larger sizes either, so I can't blame everything on overhead. lol  And trying `10_000`x`10_000`: ```julia julia> M = K = N...
I ran those benchmarks last night. I'm creating a library now to better organize blas benchmarks.
 https://github.com/chriselrod/BLASBenchmarks.jl
Multi-threaded. The plot you shared only goes to 10^3 = 1000. Tullio only starts lagging beyond that point on my computer. Impressive how close Tullio was in that benchmark. Note...
Also, another experiment: Replicating the Task API with channels: ```julia using Base.Threads nthreads() > 1 || exit() const FCHANNEL = [Channel{Ptr{Cvoid}}(1) for _ ∈ 2:nthreads()]; function crun(chn) f = take!(chn)...
 A smaller range of benchmarks on my laptop (4 cores), and the results look a lot more like Tullio's. Except that my CPU is too new for OpenBLAS, so...
I'll do so when I finish tuning. BLAS does not support integer matrices, so anything based on LoopVectorization will be very fast compared to alternatives. `LinearAlgebra.mul!` will call a generic...
It's true in general. MKL does not either. FWIW, I also use the OpenBLAS that ships with OpenBLAS_jll for the benchmarks.
I uploaded much better looking multithreading benchmarks [here](https://chriselrod.github.io/PaddedMatrices.jl/dev/arches/cascadelake/#Cascadelake) a few days ago, but I've still been experimenting with further improvements. Also, RAM requirements for dense matrices of various sizes: ```julia...