runtime [mono] Add Vector128 Sum intrinsic for amd64

Add support for the following Vector128 API's:

Sum: It doesn't support byte and sbyte types yet. It does generate instructions for i64 type but not intrinsics but the assembly generated is significantly smaller than without it.

Sep 06 '22 16:09 matouskozak

I'm nitpicking here. For f32, this horizontal sum boils down to:

haddps xmm0, xmm0       ; ICL (p01 2p5) lat=6, thr=1/2  ; Zen3 lat=6 thr=1/2
haddps xmm0, xmm0       ; ICL (p01 2p5) lat=6, thr=1/2  ; Zen3 lat=6 thr=1/2

The haddps instruction has a latency of 6 both on ICL/TGL and Zen3. This could be slightly improved by eliminating the first haddps:

xorps xmm1, xmm1        ; ICL, Zen3 - dependency-breaker (probably lat=0)
movhlps xmm1, xmm0      ; ICL (p5) lat=1, thr=1         ; Zen3 lat=1, thr=2
addps xmm0, xmm1        ; ICL (p01) lat=4, thr=2        ; Zen3 lat=3, thr=2
haddps xmm0, xmm0       ; ICL (p01 2p5) lat=6, thr=1/2  ; Zen3 lat=6 thr=1/2

The resulting code is longer, but has a lower total latency and puts less pressure on Intel's port 5.

Still, horizontal add probably won't be executed in an inner loop, so saving 1-2 clocks of latency is not significant. And this would probably have to be measured, too.

Sep 08 '22 10:09 jandupej

The resulting code is longer, but has a lower total latency and puts less pressure on Intel's port 5.

I expect the longer code will have an overall net-negative impact in loops since it takes up 2x the space, produces a 3 instruction dependency chain, and likewise will take up additional micro-ops in the decoder.

We also have to be considerate because this can be non-deterministic if you aren't careful. For floating-point, (a + b) + c != a + (b + c) and so doing a[0] + a[1] + a[2] + a[3] for the scalar, but doing (a[0] + a[1]) + (a[2] + a[3]) for 2x hadd or (a[0] + [a2]) + (a[1] + a[3]) for shuffle, add, hadd; may all produce different results.

Sep 09 '22 16:09 tannergooding

/azp run runtime-extra-platforms

Sep 19 '22 14:09 matouskozak

Azure Pipelines successfully started running 1 pipeline(s).

Sep 19 '22 14:09 azure-pipelines[bot]

/azp run runtime-extra-platforms

Sep 20 '22 16:09 matouskozak

Azure Pipelines successfully started running 1 pipeline(s).

Sep 20 '22 16:09 azure-pipelines[bot]

runtime runtime copied to clipboard

[mono] Add Vector128 Sum intrinsic for amd64

runtime
runtime copied to clipboard