Arseny Kapoulkine
Arseny Kapoulkine
To add to the list, compare for inequality takes 3 instructions, but compare for equality takes 1. I've ran into this in context of LLVM strength-reducing one to another: it...
FWIW from my point of view, this is a good addition. In native code [I am used to], the non-determinism *between* different compilers / architectures is a given - different...
> In my tests, FMA instructions are always slower under x64. This obviously depends on the workload. In (some) floating point heavy code that I see, FMA results in performance...
Here's a motivating example from my experiments (caveats apply, ymmv, etc etc.): This is an implementation of an slerp optimization from https://zeux.io/2015/07/23/approximating-slerp/ + https://zeux.io/2016/05/05/optimizing-slerp/ for fitted nlerp (the middle ground...
> Beware that your baseline is in SSE while fast-math+fma is in AVX2. Thanks! Good catch, I forgot about this. I've updated the post to include fast-math avx2 (55.82 cycles)....
With fmadd having the latency of 5 on older CPUs, you still save latency because the mul+add pair has a combined latency of 6 cycles due to the dependency. You...
(sorry, I know you're all tired of me and my "constants are awesome" comments) fabsf/fneg are single instruction (two uops) with a memory load. You only need a single constant...
If you don’t like constants, alternative lowering options: fneg: xorps (to produce zero) + subps; 2 insn if you can’t find a ready to use zero register, 1 if you...
FWIW something along the lines of i8x16.byteshuffle is a feature I'd love to see in wasm. As mentioned in the OP, it is supported by SSSE3 on x86 side via...
In the linked PR, @tlively and @dtig requested more precise benchmark numbers that are more specific to byte shuffles. To that end, I've implemented an SSE2 fallback for the decoder...