rav1e icon indicating copy to clipboard operation
rav1e copied to clipboard

Add AVX2 implementation of `first_max_element`

Open redzic opened this issue 3 years ago • 11 comments

This also now requires BMI1 and BMI2 for AVX2 in CpuFeatureLevel (for tzcnt).

redzic avatar Sep 15 '21 19:09 redzic

Resulting assembly:

example::first_max_element_avx2:
        vmovdqu xmm0, xmmword ptr [rdi]
        vpmaxsd xmm0, xmm0, xmmword ptr [rdi + 16]
        vpshufd xmm1, xmm0, 238
        vpmaxsd xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 85
        vpmaxsd xmm0, xmm0, xmm1
        vmovd   edx, xmm0
        vpbroadcastd    ymm0, xmm0
        vpcmpeqd        ymm0, ymm0, ymmword ptr [rdi]
        vmovmskps       eax, ymm0
        tzcnt   eax, eax
        vzeroupper
        ret

tdaede avatar Sep 15 '21 21:09 tdaede

It looks like the autovectorizer fails to use ymm registers for the max. Maybe an intrinsic is needed there too?

tdaede avatar Sep 15 '21 21:09 tdaede

Coverage Status

Coverage decreased (-0.1%) to 83.273% when pulling cff1af907ffecdcfc0a45fde17472084f4492645 on redzic:find-first-max-avx2 into 2ec4e675b0298c3513110cd67f8d229112feb468 on xiph:master.

coveralls avatar Sep 15 '21 21:09 coveralls

It looks like the autovectorizer fails to use ymm registers for the max. Maybe an intrinsic is needed there too?

I don't think it's possible to use ymm registers for getting the max of 8 32-bit integers, or if it is there probably wouldn't be any benefit.

As far as I understand, this is how the algorithm LLVM uses works:

max:
        ; rdi is a pointer to the first element, and it is statically known that there are 8 elements

        ; load the first 4 elements of the slice into xmm0
        vmovdqu xmm0, xmmword ptr [rdi]

        ; do a packed max the last 4 elements of the slice (rdi + 16)
        ; basically, xmm0 now looks like this (where x represents the original 8-element array):
        ; xmm0 = [max(x[0], x[4]), max(x[1], x[5]), max(x[2], x[6]), max(x[3], x[7])]
        ;
        ; at this point, we can forget about the entire array and just worry about finding the max of
        ; the 4 elements that are now in xmm0, since the max is definitely inside xmm0
        vpmaxsd xmm0, xmm0, xmmword ptr [rdi + 16]

        ; shuffle: xmm1 = xmm0[2,3,2,3]
        vpshufd xmm1, xmm0, 238

        ; packed max of xmm0 and xmm1, store the result in xmm0
        ; here is what the registers look like before this step:
        ; xmm0 = [x[0], x[1], x[2], x[3]]
        ; xmm1 = [x[2], x[3], x[2], x[3]]
        ; so the last 2 elements of xmm0 after this instruction are going to be irrelevant,
        ; and as you can see, the max will be either the first or second element of xmm0
        vpmaxsd xmm0, xmm0, xmm1

        ; shuffle : xmm1 = xmm0[1,1,1,1]
        ; xmm0 = [max(x[0], x[2]), max(x[1], x[3]), ...]
        ; xmm1 = [max(x[1], x[3]), max(x[1], x[3]), ...]
        ; (other elements omitted since we don't care about them any more)
        vpshufd xmm1, xmm0, 85

        ; packed max of xmm0 and xmm1, store the result in xmm0
        vpmaxsd xmm0, xmm0, xmm1

        ; now xmm0 = [max(max(x[0], x[2]), max(x[1], x[3])), ...]

        ; so move the first element of xmm0, which is the final max, into eax
        vmovd   eax, xmm0
        ret

redzic avatar Sep 15 '21 22:09 redzic

Ah yeah sorry, yeah the first max comparison can only be 4 vs 4.

tdaede avatar Sep 15 '21 22:09 tdaede

I'm confused. Isn't this already optimized from the assembly called from here? https://github.com/xiph/rav1e/blob/master/src/asm/x86/cdef.rs#L195

KyleSiefring avatar Sep 16 '21 02:09 KyleSiefring

@KyleSiefring I think you're right, I didn't notice this.

If I'm not mistaken, the actual assembly corresponding to this particular function seems to be in here: https://github.com/xiph/rav1e/blob/2ec4e675b0298c3513110cd67f8d229112feb468/src/x86/cdef_avx2.asm#L1773-L1785

However, I think there could still be at least some value for making first_max_element in the rust code take a &[i32; 8], as it allows the compiler to eliminate some bounds checks and generate less code.

redzic avatar Sep 16 '21 03:09 redzic

Could you please split the scalar &[i32; 8] part from the patch?

Your simd code would be good in the not-nasm situation but maybe it would need further discussion.

lu-zero avatar Sep 16 '21 09:09 lu-zero

Could we turn this into an implementation for a non-nasm platform, e.g. ARM/PPC/RISC-V, that doesn't implement CDEF yet?

tdaede avatar Sep 17 '21 20:09 tdaede

We probably can, I just have to figure out how to do that. I will mark the PR as WIP for now as I try to learn how to write the SIMD implementation for other platforms.

redzic avatar Sep 18 '21 00:09 redzic

@lu-zero @tdaede Now that std::simd has reached nightly, we could probably implement a portable SIMD version of this function with it, but gate it behind nightly, similar to how aarch64 intrinsics are currently being used.

redzic avatar Nov 16 '21 03:11 redzic