FFmpeg Add AVX2 assembly code for SAD

Based on Add AVX2 assembly code for inter predict #51

DMVR (decoder-side motion vector refinement) computes SAD on PUs with the following constraints

w >= 8, h >= 8, w*h >= 128
only computed on even rows to reduce complexity

VVC_HDR_UHDTV2_OpenGOP_7680x4320_50fps_HLG10_HighBitrate.bit
+   11.94%     0.00%  ffmpeg_g      [unknown]             [.] 0000000000000000                          
+    7.02%     0.00%  ffmpeg_g      [unknown]             [.] 0x7300747865746e6f                        
+    7.02%     0.00%  ffmpeg_g      ffmpeg_g              [.] context_to_name                           
+    5.06%     0.76%  ffmpeg_g      ffmpeg_g              [.] ff_vvc_reconstruct                        
     4.69%     4.65%  ffmpeg_g      ffmpeg_g              [.] vvc_sad                                   
+    4.54%     4.49%  ffmpeg_g      ffmpeg_g              [.] vvc_deblock_bs                            
+    4.11%     4.07%  ffmpeg_g      ffmpeg_g              [.] dmvr_hv_10

VVC_HFR_UHDTV1_OpenGOP_3840x2160_100fps_SDR.bit
+   10.07%     0.00%  ffmpeg_g      [unknown]             [.] 0000000000000000             
+    7.71%     0.00%  ffmpeg_g      [unknown]             [.] 0x7300747865746e6f       
+    7.71%     0.00%  ffmpeg_g      ffmpeg_g              [.] context_to_name           
+    5.50%     5.48%  ffmpeg_g      libc.so.6             [.] __memmove_avx512_unaligned_erms        
+    4.93%     4.93%  vf#0:0        ffmpeg_g              [.] planarCopyWrapper        
+    4.89%     4.89%  ffmpeg_g      ffmpeg_g              [.] vvc_deblock_bs      
+    4.10%     4.10%  ffmpeg_g      ffmpeg_g              [.] dmvr_hv_10          
     4.00%     3.99%  ffmpeg_g      ffmpeg_g              [.] vvc_sad

VVC_HFR_UHDTV2_OpenGOP_7680x4320_100fps_SDR.bit
+    9.99%     0.00%  ffmpeg_g      [unknown]             [.] 0000000000000000
+    7.11%     0.00%  ffmpeg_g      [unknown]             [.] 0x7300747865746e6f
+    7.11%     0.00%  ffmpeg_g      ffmpeg_g              [.] context_to_name
+    6.61%     6.59%  ffmpeg_g      ffmpeg_g              [.] vvc_sad
+    6.32%     6.30%  ffmpeg_g      ffmpeg_g              [.] derive_bdof_vx_vy_8.constprop.0
+    5.01%     4.99%  ffmpeg_g      ffmpeg_g              [.] vvc_deblock_bs

There's 8bit versions of SAD that can be used in pixelutils but, as far as I can tell, no 16bit versions (after loosely searching, it seems all sad implementations use psadbw).

As such adding 16bpc SAD could be beneficial for performance.

Apr 13 '24 14:04 stone-d-chen

I've started some initial work here: https://github.com/stone-d-chen/ffvvc/pull/6

Apr 13 '24 14:04 stone-d-chen

16bpc seems mostly done, I'll run a few more benchmarks before moving onto 8bpc

VVC_HFR_UHDTV2_OpenGOP_7680x4320_100fps_SDR.bit

     0.89%     0.88%  ffmpeg_g      ffmpeg_g  [.] ff_vvc_sad_16_16bpc_avx2
+    0.87%     0.86%  vf#0:0        ffmpeg_g  [.] ff_sad16_sse2
     0.30%     0.29%  vf#0:0        ffmpeg_g  [.] ff_sad8_mmxext
     0.21%     0.20%  vf#0:0        ffmpeg_g  [.] ff_sad16_approx_xy2_sse2
     0.20%     0.20%  vf#0:0        ffmpeg_g  [.] sad_hpel_motion_search
     0.10%     0.10%  vf#0:0        ffmpeg_g  [.] ff_sad16_x2_sse2
     0.08%     0.08%  vf#0:0        ffmpeg_g  [.] ff_sad16_y2_sse2
     0.06%     0.06%  enc0:0:mpeg4  ffmpeg_g  [.] ff_sad16_sse2
     0.02%     0.02%  enc0:0:mpeg4  ffmpeg_g  [.] ff_sad8_mmxext
     0.01%     0.01%  enc0:0:mpeg4  ffmpeg_g  [.] ff_sad16_approx_xy2_sse2
     0.01%     0.01%  enc0:0:mpeg4  ffmpeg_g  [.] sad_hpel_motion_search
     0.01%     0.01%  enc0:0:mpeg4  ffmpeg_g  [.] ff_sad16_x2_sse2
     0.01%     0.01%  enc0:0:mpeg4  ffmpeg_g  [.] ff_sad16_y2_sse2

AVX2:
 - vvc_sad.check_vvc_sad_8_16bpc   [OK]
 - vvc_sad.check_vvc_sad_16_16bpc  [OK]
 - vvc_sad.check_vvc_sad_32_16bpc  [OK]
 - vvc_sad.check_vvc_sad_64_16bpc  [OK]
 - vvc_sad.check_vvc_sad_128_16bpc [OK]
checkasm: all 5 tests passed
vvc_sad_8_16bpc_c: 135.5
vvc_sad_8_16bpc_avx2: 15.5

vvc_sad_16_16bpc_c: 275.5
vvc_sad_16_16bpc_avx2: 25.5

vvc_sad_32_16bpc_c: 1085.5
vvc_sad_32_16bpc_avx2: 85.5

vvc_sad_64_16bpc_c: 4255.5
vvc_sad_64_16bpc_avx2: 375.5

vvc_sad_128_16bpc_c: 17505.5
vvc_sad_128_16bpc_avx2: 1945.5

Apr 15 '24 23:04 stone-d-chen

Hi @nuomi2021 (cc @QSXW ) I've made a pull request here https://github.com/ffvvc/FFmpeg/pull/213

Though I realized that maybe I should've been basing my code off of ffvvc/main and not ffvvc/up?

Apr 17 '24 11:04 stone-d-chen

Thank you for the patch Please use

https://github.com/ffmpeg/FFmpeg master branch
cherry-pick this commit.
and send pr to ffvvc/up

Apr 17 '24 12:04 nuomi2021

Hi, I've created a new pull request here https://github.com/ffvvc/FFmpeg/pull/215

I believe I did this correctly but I'm still relatively unfamiliar with git haha.

Apr 18 '24 02:04 stone-d-chen