rav1e
rav1e copied to clipboard
Add AVX2 implementation of `first_max_element`
This also now requires BMI1 and BMI2 for AVX2 in CpuFeatureLevel
(for tzcnt).
Resulting assembly:
example::first_max_element_avx2:
vmovdqu xmm0, xmmword ptr [rdi]
vpmaxsd xmm0, xmm0, xmmword ptr [rdi + 16]
vpshufd xmm1, xmm0, 238
vpmaxsd xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 85
vpmaxsd xmm0, xmm0, xmm1
vmovd edx, xmm0
vpbroadcastd ymm0, xmm0
vpcmpeqd ymm0, ymm0, ymmword ptr [rdi]
vmovmskps eax, ymm0
tzcnt eax, eax
vzeroupper
ret
It looks like the autovectorizer fails to use ymm registers for the max. Maybe an intrinsic is needed there too?
Coverage decreased (-0.1%) to 83.273% when pulling cff1af907ffecdcfc0a45fde17472084f4492645 on redzic:find-first-max-avx2 into 2ec4e675b0298c3513110cd67f8d229112feb468 on xiph:master.
It looks like the autovectorizer fails to use ymm registers for the max. Maybe an intrinsic is needed there too?
I don't think it's possible to use ymm registers for getting the max of 8 32-bit integers, or if it is there probably wouldn't be any benefit.
As far as I understand, this is how the algorithm LLVM uses works:
max:
; rdi is a pointer to the first element, and it is statically known that there are 8 elements
; load the first 4 elements of the slice into xmm0
vmovdqu xmm0, xmmword ptr [rdi]
; do a packed max the last 4 elements of the slice (rdi + 16)
; basically, xmm0 now looks like this (where x represents the original 8-element array):
; xmm0 = [max(x[0], x[4]), max(x[1], x[5]), max(x[2], x[6]), max(x[3], x[7])]
;
; at this point, we can forget about the entire array and just worry about finding the max of
; the 4 elements that are now in xmm0, since the max is definitely inside xmm0
vpmaxsd xmm0, xmm0, xmmword ptr [rdi + 16]
; shuffle: xmm1 = xmm0[2,3,2,3]
vpshufd xmm1, xmm0, 238
; packed max of xmm0 and xmm1, store the result in xmm0
; here is what the registers look like before this step:
; xmm0 = [x[0], x[1], x[2], x[3]]
; xmm1 = [x[2], x[3], x[2], x[3]]
; so the last 2 elements of xmm0 after this instruction are going to be irrelevant,
; and as you can see, the max will be either the first or second element of xmm0
vpmaxsd xmm0, xmm0, xmm1
; shuffle : xmm1 = xmm0[1,1,1,1]
; xmm0 = [max(x[0], x[2]), max(x[1], x[3]), ...]
; xmm1 = [max(x[1], x[3]), max(x[1], x[3]), ...]
; (other elements omitted since we don't care about them any more)
vpshufd xmm1, xmm0, 85
; packed max of xmm0 and xmm1, store the result in xmm0
vpmaxsd xmm0, xmm0, xmm1
; now xmm0 = [max(max(x[0], x[2]), max(x[1], x[3])), ...]
; so move the first element of xmm0, which is the final max, into eax
vmovd eax, xmm0
ret
Ah yeah sorry, yeah the first max comparison can only be 4 vs 4.
I'm confused. Isn't this already optimized from the assembly called from here? https://github.com/xiph/rav1e/blob/master/src/asm/x86/cdef.rs#L195
@KyleSiefring I think you're right, I didn't notice this.
If I'm not mistaken, the actual assembly corresponding to this particular function seems to be in here: https://github.com/xiph/rav1e/blob/2ec4e675b0298c3513110cd67f8d229112feb468/src/x86/cdef_avx2.asm#L1773-L1785
However, I think there could still be at least some value for making first_max_element
in the rust code take a &[i32; 8]
, as it allows the compiler to eliminate some bounds checks and generate less code.
Could you please split the scalar &[i32; 8]
part from the patch?
Your simd code would be good in the not-nasm situation but maybe it would need further discussion.
Could we turn this into an implementation for a non-nasm platform, e.g. ARM/PPC/RISC-V, that doesn't implement CDEF yet?
We probably can, I just have to figure out how to do that. I will mark the PR as WIP for now as I try to learn how to write the SIMD implementation for other platforms.
@lu-zero @tdaede Now that std::simd
has reached nightly, we could probably implement a portable SIMD version of this function with it, but gate it behind nightly, similar to how aarch64 intrinsics are currently being used.