optimize put_uni_pixels_N_128x128 AVX2/SSE4 code
see https://github.com/ffvvc/FFmpeg/pull/146#issuecomment-1749907342 we have a similar issue for put_pixels too, see https://github.com/ffvvc/FFmpeg/pull/145#issuecomment-1749894316
How to reproduce it: make checkasm -j && ./tests/checkasm/checkasm --test=vvc_mc --bench
Hi, I have been investigating the performance issue and it seems like memcopy in the C code is moving 128 bytes in single iteration and sse4 code is moving 16 bytes in a single iteration. Can this be the reason of slowness ?
This was the code I saw while debugging.
Memcopy Code
ff_vvc_put_uni_pixels16_8_sse4
@rohanjulka19 , sorry for missed your post. Yes, this may be the reason, could you help send 3 patches to the mailing list for this? One for hevc, one for vvc. then you can remove sse 128 using another patch.
also, some 64xX have similar issues, could also help check? thank you
put_luma_uni_pixels_8_64x4_c: 10.1
put_luma_uni_pixels_8_64x4_sse4: 24.6
put_luma_uni_pixels_8_64x4_avx2: 15.1
comment and commit log are important too. It's easy to merge if it's clear to reviewers.