Add AVX2 assembly code for SAD
Based on Add AVX2 assembly code for inter predict #51
DMVR (decoder-side motion vector refinement) computes SAD on PUs with the following constraints
- w >= 8, h >= 8, w*h >= 128
- only computed on even rows to reduce complexity
VVC_HDR_UHDTV2_OpenGOP_7680x4320_50fps_HLG10_HighBitrate.bit
+ 11.94% 0.00% ffmpeg_g [unknown] [.] 0000000000000000
+ 7.02% 0.00% ffmpeg_g [unknown] [.] 0x7300747865746e6f
+ 7.02% 0.00% ffmpeg_g ffmpeg_g [.] context_to_name
+ 5.06% 0.76% ffmpeg_g ffmpeg_g [.] ff_vvc_reconstruct
4.69% 4.65% ffmpeg_g ffmpeg_g [.] vvc_sad
+ 4.54% 4.49% ffmpeg_g ffmpeg_g [.] vvc_deblock_bs
+ 4.11% 4.07% ffmpeg_g ffmpeg_g [.] dmvr_hv_10
VVC_HFR_UHDTV1_OpenGOP_3840x2160_100fps_SDR.bit
+ 10.07% 0.00% ffmpeg_g [unknown] [.] 0000000000000000
+ 7.71% 0.00% ffmpeg_g [unknown] [.] 0x7300747865746e6f
+ 7.71% 0.00% ffmpeg_g ffmpeg_g [.] context_to_name
+ 5.50% 5.48% ffmpeg_g libc.so.6 [.] __memmove_avx512_unaligned_erms
+ 4.93% 4.93% vf#0:0 ffmpeg_g [.] planarCopyWrapper
+ 4.89% 4.89% ffmpeg_g ffmpeg_g [.] vvc_deblock_bs
+ 4.10% 4.10% ffmpeg_g ffmpeg_g [.] dmvr_hv_10
4.00% 3.99% ffmpeg_g ffmpeg_g [.] vvc_sad
VVC_HFR_UHDTV2_OpenGOP_7680x4320_100fps_SDR.bit
+ 9.99% 0.00% ffmpeg_g [unknown] [.] 0000000000000000
+ 7.11% 0.00% ffmpeg_g [unknown] [.] 0x7300747865746e6f
+ 7.11% 0.00% ffmpeg_g ffmpeg_g [.] context_to_name
+ 6.61% 6.59% ffmpeg_g ffmpeg_g [.] vvc_sad
+ 6.32% 6.30% ffmpeg_g ffmpeg_g [.] derive_bdof_vx_vy_8.constprop.0
+ 5.01% 4.99% ffmpeg_g ffmpeg_g [.] vvc_deblock_bs
There's 8bit versions of SAD that can be used in pixelutils but, as far as I can tell, no 16bit versions (after loosely searching, it seems all sad implementations use psadbw).
As such adding 16bpc SAD could be beneficial for performance.
I've started some initial work here: https://github.com/stone-d-chen/ffvvc/pull/6
16bpc seems mostly done, I'll run a few more benchmarks before moving onto 8bpc
VVC_HFR_UHDTV2_OpenGOP_7680x4320_100fps_SDR.bit
0.89% 0.88% ffmpeg_g ffmpeg_g [.] ff_vvc_sad_16_16bpc_avx2
+ 0.87% 0.86% vf#0:0 ffmpeg_g [.] ff_sad16_sse2
0.30% 0.29% vf#0:0 ffmpeg_g [.] ff_sad8_mmxext
0.21% 0.20% vf#0:0 ffmpeg_g [.] ff_sad16_approx_xy2_sse2
0.20% 0.20% vf#0:0 ffmpeg_g [.] sad_hpel_motion_search
0.10% 0.10% vf#0:0 ffmpeg_g [.] ff_sad16_x2_sse2
0.08% 0.08% vf#0:0 ffmpeg_g [.] ff_sad16_y2_sse2
0.06% 0.06% enc0:0:mpeg4 ffmpeg_g [.] ff_sad16_sse2
0.02% 0.02% enc0:0:mpeg4 ffmpeg_g [.] ff_sad8_mmxext
0.01% 0.01% enc0:0:mpeg4 ffmpeg_g [.] ff_sad16_approx_xy2_sse2
0.01% 0.01% enc0:0:mpeg4 ffmpeg_g [.] sad_hpel_motion_search
0.01% 0.01% enc0:0:mpeg4 ffmpeg_g [.] ff_sad16_x2_sse2
0.01% 0.01% enc0:0:mpeg4 ffmpeg_g [.] ff_sad16_y2_sse2
AVX2:
- vvc_sad.check_vvc_sad_8_16bpc [OK]
- vvc_sad.check_vvc_sad_16_16bpc [OK]
- vvc_sad.check_vvc_sad_32_16bpc [OK]
- vvc_sad.check_vvc_sad_64_16bpc [OK]
- vvc_sad.check_vvc_sad_128_16bpc [OK]
checkasm: all 5 tests passed
vvc_sad_8_16bpc_c: 135.5
vvc_sad_8_16bpc_avx2: 15.5
vvc_sad_16_16bpc_c: 275.5
vvc_sad_16_16bpc_avx2: 25.5
vvc_sad_32_16bpc_c: 1085.5
vvc_sad_32_16bpc_avx2: 85.5
vvc_sad_64_16bpc_c: 4255.5
vvc_sad_64_16bpc_avx2: 375.5
vvc_sad_128_16bpc_c: 17505.5
vvc_sad_128_16bpc_avx2: 1945.5
Hi @nuomi2021 (cc @QSXW ) I've made a pull request here https://github.com/ffvvc/FFmpeg/pull/213
Though I realized that maybe I should've been basing my code off of ffvvc/main and not ffvvc/up?
Thank you for the patch Please use
- https://github.com/ffmpeg/FFmpeg master branch
- cherry-pick this commit.
- and send pr to ffvvc/up
Hi, I've created a new pull request here https://github.com/ffvvc/FFmpeg/pull/215
I believe I did this correctly but I'm still relatively unfamiliar with git haha.