Chris Elrod

Results 832 comments of Chris Elrod
trafficstars

> The Clang generated assembly looks pretty good. But it's still slow. 0.135 is decent, that is 92% peak. > > I thought I got 100% of the peak on...

Doubling the FMA units on Cascadelake (and sacrificing a lot of out of order capabilities, a bit of cache size, etc...): ```julia julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin foreachf(AmulBplusC!, 100_000, C0, A,...

> The Clang results hint that `gemm_nkm` is probably exactly written to be optimal and to vectorize over `m`: The inner most loop in the two assembly examples I posted...

> I thought I got 100% of the peak on Intel hardware about 5 years ago. However the matrices might have been larger. For smaller matrices (and 64 I think...

Which libraries is this using? How can I reproduce it? Searching JuliaHub for `Categorical_Cross_Entropy_Loss` shows 0 hits. EDIT: https://github.com/SkyWorld117/YisyAIFramework.jl I was going to say that it's unfortunately expected that using...

But I'll also try something that should at least limit the slowdowns.

Switching back and forth between using CheapThreads's threads (which LoopVectorization uses) and base threads is also expected to cause performance problems, not just nesting.

I've added `batch` with the option to reserve threads, but if you're not nesting threads, the ordinary `batch` method in place of `@threads` would be fine. If you do try...

If you `Ctrl+c` and it doesn't crash Julia, you could ```julia julia> using ThreadingUtilites julia> ThreadingUtilities.TASKS ``` if one of them, say the third, says it failed ```julia julia> ThreadingUtilities.TASKS[3]...

Just to be a broken record, you should avoid indexes like this in general whenever you can, especially if `x` is likely to be `1`: ```julia (i-1)*x ``` SIMD means...