runtime
runtime copied to clipboard
[mono] Add Vector128 Sum intrinsic for amd64
Add support for the following Vector128 API's:
- Sum: It doesn't support byte and sbyte types yet. It does generate instructions for i64 type but not intrinsics but the assembly generated is significantly smaller than without it.
I'm nitpicking here. For f32
, this horizontal sum boils down to:
haddps xmm0, xmm0 ; ICL (p01 2p5) lat=6, thr=1/2 ; Zen3 lat=6 thr=1/2
haddps xmm0, xmm0 ; ICL (p01 2p5) lat=6, thr=1/2 ; Zen3 lat=6 thr=1/2
The haddps
instruction has a latency of 6 both on ICL/TGL and Zen3. This could be slightly improved by eliminating the first haddps
:
xorps xmm1, xmm1 ; ICL, Zen3 - dependency-breaker (probably lat=0)
movhlps xmm1, xmm0 ; ICL (p5) lat=1, thr=1 ; Zen3 lat=1, thr=2
addps xmm0, xmm1 ; ICL (p01) lat=4, thr=2 ; Zen3 lat=3, thr=2
haddps xmm0, xmm0 ; ICL (p01 2p5) lat=6, thr=1/2 ; Zen3 lat=6 thr=1/2
The resulting code is longer, but has a lower total latency and puts less pressure on Intel's port 5.
Still, horizontal add probably won't be executed in an inner loop, so saving 1-2 clocks of latency is not significant. And this would probably have to be measured, too.
The resulting code is longer, but has a lower total latency and puts less pressure on Intel's port 5.
I expect the longer code will have an overall net-negative impact in loops since it takes up 2x the space, produces a 3 instruction dependency chain, and likewise will take up additional micro-ops in the decoder.
We also have to be considerate because this can be non-deterministic if you aren't careful. For floating-point, (a + b) + c
!= a + (b + c)
and so doing a[0] + a[1] + a[2] + a[3]
for the scalar, but doing (a[0] + a[1]) + (a[2] + a[3])
for 2x hadd or (a[0] + [a2]) + (a[1] + a[3])
for shuffle, add, hadd; may all produce different results.
/azp run runtime-extra-platforms
Azure Pipelines successfully started running 1 pipeline(s).
/azp run runtime-extra-platforms
Azure Pipelines successfully started running 1 pipeline(s).