FFmpeg optimize put_uni_pixels_N_128x128 AVX2/SSE4 code

see https://github.com/ffvvc/FFmpeg/pull/146#issuecomment-1749907342 we have a similar issue for put_pixels too, see https://github.com/ffvvc/FFmpeg/pull/145#issuecomment-1749894316

Oct 06 '23 02:10 nuomi2021

How to reproduce it: make checkasm -j && ./tests/checkasm/checkasm --test=vvc_mc --bench

Oct 06 '23 02:10 nuomi2021

Hi, I have been investigating the performance issue and it seems like memcopy in the C code is moving 128 bytes in single iteration and sse4 code is moving 16 bytes in a single iteration. Can this be the reason of slowness ?

This was the code I saw while debugging.

Memcopy Code

ff_vvc_put_uni_pixels16_8_sse4

Screenshot 2024-02-04 at 4 56 00 PM

Feb 04 '24 17:02 rohanjulka19

@rohanjulka19 , sorry for missed your post. Yes, this may be the reason, could you help send 3 patches to the mailing list for this? One for hevc, one for vvc. then you can remove sse 128 using another patch.

also, some 64xX have similar issues, could also help check? thank you

put_luma_uni_pixels_8_64x4_c: 10.1
put_luma_uni_pixels_8_64x4_sse4: 24.6
put_luma_uni_pixels_8_64x4_avx2: 15.1

Jul 20 '24 02:07 nuomi2021

comment and commit log are important too. It's easy to merge if it's clear to reviewers.

Jul 20 '24 02:07 nuomi2021