Chris Elrod comments

Results 832 comments of


                                            Chris Elrod

trafficstars

Matrix Multiplication benchmark analysis

> The Clang generated assembly looks pretty good. But it's still slow. 0.135 is decent, that is 92% peak. > > I thought I got 100% of the peak on...

Matrix Multiplication benchmark analysis

Doubling the FMA units on Cascadelake (and sacrificing a lot of out of order capabilities, a bit of cache size, etc...): ```julia julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin foreachf(AmulBplusC!, 100_000, C0, A,...

Chris Elrod

Matrix Multiplication benchmark analysis

Matrix Multiplication benchmark analysis

Matrix Multiplication benchmark analysis

Matrix Multiplication benchmark analysis

`@avxt` harms the performence of `.Threads`

`@avxt` harms the performence of `.Threads`

`@avxt` harms the performence of `.Threads`

`@avxt` harms the performence of `.Threads`

`@avxt` harms the performence of `.Threads`

`@avxt` harms the performence of `.Threads`