Chris Elrod

Results 832 comments of Chris Elrod
trafficstars

> However your instruction also reads from memory, I am not sure what the cost is there. https://uops.info/html-instr/VFMADD231PD_ZMM_ZMM_M512.html Theoretically 0.5 "calculated from port usage", but measured at 0.55. However, this...

[uica is a great tool for Intel architectures](https://bit.ly/3BRoO32). It estimates that the inner loop should take 4.5 cycles on Cascade Lake, or 8 on Tiger Lake. These correspond to about...

> I also tested GFortran: [Here](https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/benchmark/looptests.f90) is the Fortran code I used for the benchmarks. Simple improvements/corrections are welcome, but the point of course is seeing how well the compilers...

Tiger Lake makes it very easy to get a high percentage of peak, because it is very bottlenecked by lack of (FMA) execution resources: ```julia julia> using LinuxPerf julia> foreachf(f::F,...

> Ok, so the vectorization goes roughly like this if I understand it correctly. You start from: > And the cost is 1 fma, and 1 memory read (R). For...

I'm away from home for the weekend, so I can't promise that I'll be able to fix this until I get back. You have `t = times[ii]`, but I don't...

`@turbo` made integers static, which now seems like a bad choice given `StaticInt` is no longer an integer.

Two non-mutually exclusive options: 1. Add a `ArrayInterface.firstindex` and `ArrayInterface.lastindex` function that accepts `StaticInt` and have LoopVectorization import it and substitute calls, like it does for `axes`. 2. Use a...

> What was the reason for not making `StaticInt` an `Integer` anymore? This seems to be the basic root of the problem here. Invalidations / improve loading time. https://github.com/SciML/Static.jl/pull/64 It...

So, the five checks are... 1. Bounds checks. Seems reasonable. 2. Already supported. Use `@turbo check_empty=true for ...`. The overhead for this should be extremely small/hard to measure. 3. Execution...