Chris Elrod comments

Results 849 comments of


                                            Chris Elrod

Some initial benchmark comparisons?

But an example of what I'm seeing: ```julia julia> include("/home/chriselrod/.julia/dev/PaddedMatrices/benchmark/blasbench.jl") plot (generic function with 3 methods) julia> mkl_set_num_threads(18) julia> openblas_set_num_threads(18) julia> M = K = N = 80; julia> A...

Some initial benchmark comparisons?

But to be fair, performance isn't great at larger sizes either, so I can't blame everything on overhead. lol ![gemmFloat64_2_4000_cascadelake_AVX512_18thread](https://user-images.githubusercontent.com/8043603/103450292-19f6db80-4c82-11eb-948c-da2921ed1c3e.png) And trying `10_000`x`10_000`: ```julia julia> M = K = N...

Some initial benchmark comparisons?

I ran those benchmarks last night. I'm creating a library now to better organize blas benchmarks.

Some initial benchmark comparisons?

![gemm_Float64_10_10000_cascadelake_AVX512__multithreaded_logscale](https://user-images.githubusercontent.com/8043603/103456315-e25c5380-4cc2-11eb-9415-28b8cbc75ad3.png) https://github.com/chriselrod/BLASBenchmarks.jl

Some initial benchmark comparisons?

Multi-threaded. The plot you shared only goes to 10^3 = 1000. Tullio only starts lagging beyond that point on my computer. Impressive how close Tullio was in that benchmark. Note...

Some initial benchmark comparisons?

Also, another experiment: Replicating the Task API with channels: ```julia using Base.Threads nthreads() > 1 || exit() const FCHANNEL = [Channel{Ptr{Cvoid}}(1) for _ ∈ 2:nthreads()]; function crun(chn) f = take!(chn)...

Some initial benchmark comparisons?

![gemm_Float64_10_1000_tigerlake_AVX512__multithreaded_logscale](https://user-images.githubusercontent.com/8043603/103457509-02454480-4cce-11eb-9d9c-6efaece8cd43.png) A smaller range of benchmarks on my laptop (4 cores), and the results look a lot more like Tullio's. Except that my CPU is too new for OpenBLAS, so...

Some initial benchmark comparisons?

I'll do so when I finish tuning. BLAS does not support integer matrices, so anything based on LoopVectorization will be very fast compared to alternatives. `LinearAlgebra.mul!` will call a generic...

Some initial benchmark comparisons?

It's true in general. MKL does not either. FWIW, I also use the OpenBLAS that ships with OpenBLAS_jll for the benchmarks.

Some initial benchmark comparisons?

I uploaded much better looking multithreading benchmarks [here](https://chriselrod.github.io/PaddedMatrices.jl/dev/arches/cascadelake/#Cascadelake) a few days ago, but I've still been experimenting with further improvements. Also, RAM requirements for dense matrices of various sizes: ```julia...