FFmpeg FFmpeg HEVC IDCT port

This PR ports FFmpeg's HEVC IDCT optimisations.

To-do:

[x] Port 4x4 transform
[x] Port 8x8 transform
[x] Port 16x16 transform
[x] Port 32x32 transform
[ ] Add 2x2 transform
[ ] Add 64x64 transform
[ ] Add rectangular transforms
[ ] Add 1D transforms
[ ] Change residual type to int32_t
[ ] Refactor template code for 1-D functions to reduce duplication

Checkasm benchmark results:

inv_dct2_dct2_4x4_8_c: 226.2
inv_dct2_dct2_4x4_8_avx2: 26.2
inv_dct2_dct2_4x4_10_c: 188.2
inv_dct2_dct2_4x4_10_avx2: 24.7
inv_dct2_dct2_8x8_8_c: 704.7
inv_dct2_dct2_8x8_8_avx2: 124.7
inv_dct2_dct2_8x8_10_c: 751.2
inv_dct2_dct2_8x8_10_avx2: 124.7
inv_dct2_dct2_16x16_8_c: 4289.7
inv_dct2_dct2_16x16_8_avx2: 621.2
inv_dct2_dct2_16x16_10_c: 4335.2
inv_dct2_dct2_16x16_10_avx2: 625.2

perf.py results:

Bitstream	Before	After	Delta
RitualDance_1920x1080_60_10_420_32_LD	99.7	99.3	-0.4%
RitualDance_1920x1080_60_10_420_37_RA	88.3	87.7	-0.5%
Tango2_3840x2160_60_10_420_27_LD	23.0	23.0	0.0%

The current perf.py performance is poor as the DCT's effect on overall decoding performance is dominated by the larger sizes which have not yet been implemented. The decrease in performance is explained by the additional overhead of optimising at the 2D level, the benefits of which are not being reaped here. As the larger sizes are implemented, performance will increase dramatically, in line with the checkasm benchmark result.

Aug 27 '23 14:08 frankplow

hi @frankplow , seems the int32_t is only needed by range extension. If range extension is not enabled, we can keep the transform coeffs as int16_t. I will try to make some changes to this. Hope this will reduce the porting efforts.

Dec 05 '23 09:12 nuomi2021

hi @frankplow , seems the int32_t is only needed by range extension. If range extension is not enabled, we can keep the transform coeffs as int16_t. I will try to make some changes to this. Hope this will reduce the porting efforts.

Yeah I think if we take this approach, it shouldn't be too hard to get transforms implemented for the square sizes. Unfortunately, I think it will be hard to extend the HEVC optimisations to rectangular sizes and MTS as the way it's written doesn't facilitate much code reuse/modularity. I have a branch where I've worked on a more modular optimisation, based on some of the custom ABI ideas dav1d uses, but I'm having to write this from the ground up and don't have much time alongside my Master's at the moment. I think then, the best way to get optimisations in for these most common square sizes is to, as you say, allow varying the coeff type based on whether the range extension is active and then port the HEVC transforms.

Dec 11 '23 13:12 frankplow

but I'm having to write this from the ground up and don't have much time alongside my Master's at the moment

No worries. I will continue your work after I have done the thread optimizations.

Dec 16 '23 08:12 nuomi2021