runtime
runtime copied to clipboard
Vector128.Min<int> produces sub-optimal code on x64
Repro:
[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<int> Foo(Vector128<int> a, Vector128<int> b)
{
return Vector128.Min(a, b);
}
Actual:
G_M000_IG01: ;; offset=0000H
C5F877 vzeroupper
G_M000_IG02: ;; offset=0003H
C5F91002 vmovupd xmm0, xmmword ptr [rdx]
C4C1791008 vmovupd xmm1, xmmword ptr [r8]
C5F166D0 vpcmpgtd xmm2, xmm1, xmm0
C4E3714CC020 vpblendvb xmm0, xmm1, xmm0, xmm2
C5F91101 vmovupd xmmword ptr [rcx], xmm0
488BC1 mov rax, rcx
G_M000_IG03: ;; offset=001DH
C3 ret
Expected (when SSE4.1 is available which is reasonable):
G_M000_IG01: ;; offset=0000H
C5F877 vzeroupper
G_M000_IG02: ;; offset=0003H
C5F91002 vmovupd xmm0, xmmword ptr [rdx]
C4C2793900 vpminsd xmm0, xmm0, xmmword ptr [r8]
C5F91101 vmovupd xmmword ptr [rcx], xmm0
488BC1 mov rax, rcx
G_M000_IG03: ;; offset=0013H
C3 ret
So a vpminsd
instead of a vpcmpgtd + vpblendvb
combo.
Note: in SSE2 there's no overload for int
, only for SSE4.1.
I don't know how it is in Arm64.
Same for Vector128.Max<int>
(didn't check others).
Organizational question: do you prefere that for each such findings an issue is to be opened or is there a more holistically way for this?
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Issue Details
Repro:
[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<int> Foo(Vector128<int> a, Vector128<int> b)
{
return Vector128.Min(a, b);
}
Actual:
G_M000_IG01: ;; offset=0000H
C5F877 vzeroupper
G_M000_IG02: ;; offset=0003H
C5F91002 vmovupd xmm0, xmmword ptr [rdx]
C4C1791008 vmovupd xmm1, xmmword ptr [r8]
C5F166D0 vpcmpgtd xmm2, xmm1, xmm0
C4E3714CC020 vpblendvb xmm0, xmm1, xmm0, xmm2
C5F91101 vmovupd xmmword ptr [rcx], xmm0
488BC1 mov rax, rcx
G_M000_IG03: ;; offset=001DH
C3 ret
Expected (when SSE4.1 is available which is reasonable):
G_M000_IG01: ;; offset=0000H
C5F877 vzeroupper
G_M000_IG02: ;; offset=0003H
C5F91002 vmovupd xmm0, xmmword ptr [rdx]
C4C2793900 vpminsd xmm0, xmm0, xmmword ptr [r8]
C5F91101 vmovupd xmmword ptr [rcx], xmm0
488BC1 mov rax, rcx
G_M000_IG03: ;; offset=0013H
C3 ret
So a vpminsd
instead of a vpcmpgtd + vpblendvb
combo.
Note: in SSE2 there's no overload for int
, only for SSE4.1.
I don't know how it is in Arm64.
Organizational question: do you prefere that for each such findings an issue is to be opened or is there a more holistically way for this?
Author: | gfoidl |
---|---|
Assignees: | - |
Labels: |
|
Milestone: | - |
I don't know how it is in Arm64.
Seems to be fine on ARM64 🙂
G_M39000_IG01: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
A9BF7BFD stp fp, lr, [sp, #-0x10]!
910003FD mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M39000_IG02: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
4EA16C00 smin v0.4s, v0.4s, v1.4s
;; size=4 bbWeight=1 PerfScore 1.00
G_M39000_IG03: ; , epilog, nogc, extend
A8C17BFD ldp fp, lr, [sp], #0x10
D65F03C0 ret lr
;; size=8 bbWeight=1 PerfScore 2.00
Organizational question: do you prefere that for each such findings an issue is to be opened or is there a more holistically way for this?
I think they're trivial to fix in just one PR so a single issue would be enough I guess? depends on how many issues you found
I fixed up some cases when porting the Vector<T>
logic to be reimplemented using HWIntrinsics, but not nearly everything.
I'm sure there are a few more cases like this where we do something slightly suboptimal.
depends on how many issues you found
At the moment no further (hopefully this remains so 😉). For the other findins, where no issues existed, I created one (e.g. last week a couple).
a few more cases like this where we do something slightly suboptimal.
Maybe then one issue where such findings can be dropped and collected? From that sub-issues could be created if needed.
On portings some code from plat-specific to xplat and checking the codegen some things fall out. I'd like to report these as I find them, so that you are aware of (or can confirm that it is known), so that this can be fixed.