runtime icon indicating copy to clipboard operation
runtime copied to clipboard

Vector128.Min<int> produces sub-optimal code on x64

Open gfoidl opened this issue 1 year ago • 5 comments

Repro:

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<int> Foo(Vector128<int> a, Vector128<int> b)
{
    return Vector128.Min(a, b);
}

Actual:

G_M000_IG01:                ;; offset=0000H
       C5F877               vzeroupper

G_M000_IG02:                ;; offset=0003H
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C4C1791008           vmovupd  xmm1, xmmword ptr [r8]
       C5F166D0             vpcmpgtd xmm2, xmm1, xmm0
       C4E3714CC020         vpblendvb xmm0, xmm1, xmm0, xmm2
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx

G_M000_IG03:                ;; offset=001DH
       C3                   ret

Expected (when SSE4.1 is available which is reasonable):

G_M000_IG01:                ;; offset=0000H
       C5F877               vzeroupper

G_M000_IG02:                ;; offset=0003H
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C4C2793900           vpminsd  xmm0, xmm0, xmmword ptr [r8]
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx

G_M000_IG03:                ;; offset=0013H
       C3                   ret

So a vpminsd instead of a vpcmpgtd + vpblendvb combo. Note: in SSE2 there's no overload for int, only for SSE4.1.

I don't know how it is in Arm64.

Same for Vector128.Max<int> (didn't check others).


Organizational question: do you prefere that for each such findings an issue is to be opened or is there a more holistically way for this?

gfoidl avatar Sep 20 '22 13:09 gfoidl

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

Repro:

[MethodImpl(MethodImplOptions.NoInlining)]
static Vector128<int> Foo(Vector128<int> a, Vector128<int> b)
{
    return Vector128.Min(a, b);
}

Actual:

G_M000_IG01:                ;; offset=0000H
       C5F877               vzeroupper

G_M000_IG02:                ;; offset=0003H
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C4C1791008           vmovupd  xmm1, xmmword ptr [r8]
       C5F166D0             vpcmpgtd xmm2, xmm1, xmm0
       C4E3714CC020         vpblendvb xmm0, xmm1, xmm0, xmm2
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx

G_M000_IG03:                ;; offset=001DH
       C3                   ret

Expected (when SSE4.1 is available which is reasonable):

G_M000_IG01:                ;; offset=0000H
       C5F877               vzeroupper

G_M000_IG02:                ;; offset=0003H
       C5F91002             vmovupd  xmm0, xmmword ptr [rdx]
       C4C2793900           vpminsd  xmm0, xmm0, xmmword ptr [r8]
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx

G_M000_IG03:                ;; offset=0013H
       C3                   ret

So a vpminsd instead of a vpcmpgtd + vpblendvb combo. Note: in SSE2 there's no overload for int, only for SSE4.1.

I don't know how it is in Arm64.


Organizational question: do you prefere that for each such findings an issue is to be opened or is there a more holistically way for this?

Author: gfoidl
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

msftbot[bot] avatar Sep 20 '22 13:09 msftbot[bot]

I don't know how it is in Arm64.

Seems to be fine on ARM64 🙂

G_M39000_IG01:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
        A9BF7BFD          stp     fp, lr, [sp, #-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M39000_IG02:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
        4EA16C00          smin    v0.4s, v0.4s, v1.4s
						;; size=4 bbWeight=1    PerfScore 1.00
G_M39000_IG03:        ; , epilog, nogc, extend
        A8C17BFD          ldp     fp, lr, [sp], #0x10
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00

EgorBo avatar Sep 20 '22 14:09 EgorBo

Organizational question: do you prefere that for each such findings an issue is to be opened or is there a more holistically way for this?

I think they're trivial to fix in just one PR so a single issue would be enough I guess? depends on how many issues you found

EgorBo avatar Sep 20 '22 14:09 EgorBo

I fixed up some cases when porting the Vector<T> logic to be reimplemented using HWIntrinsics, but not nearly everything.

I'm sure there are a few more cases like this where we do something slightly suboptimal.

tannergooding avatar Sep 20 '22 14:09 tannergooding

depends on how many issues you found

At the moment no further (hopefully this remains so 😉). For the other findins, where no issues existed, I created one (e.g. last week a couple).

a few more cases like this where we do something slightly suboptimal.

Maybe then one issue where such findings can be dropped and collected? From that sub-issues could be created if needed.

On portings some code from plat-specific to xplat and checking the codegen some things fall out. I'd like to report these as I find them, so that you are aware of (or can confirm that it is known), so that this can be fixed.

gfoidl avatar Sep 20 '22 14:09 gfoidl