FFmpeg icon indicating copy to clipboard operation
FFmpeg copied to clipboard

FFmpeg HEVC IDCT port

Open frankplow opened this issue 2 years ago • 3 comments

This PR ports FFmpeg's HEVC IDCT optimisations.

To-do:

  • [x] Port 4x4 transform
  • [x] Port 8x8 transform
  • [x] Port 16x16 transform
  • [x] Port 32x32 transform
  • [ ] Add 2x2 transform
  • [ ] Add 64x64 transform
  • [ ] Add rectangular transforms
  • [ ] Add 1D transforms
  • [ ] Change residual type to int32_t
  • [ ] Refactor template code for 1-D functions to reduce duplication

Checkasm benchmark results:

inv_dct2_dct2_4x4_8_c: 226.2
inv_dct2_dct2_4x4_8_avx2: 26.2
inv_dct2_dct2_4x4_10_c: 188.2
inv_dct2_dct2_4x4_10_avx2: 24.7
inv_dct2_dct2_8x8_8_c: 704.7
inv_dct2_dct2_8x8_8_avx2: 124.7
inv_dct2_dct2_8x8_10_c: 751.2
inv_dct2_dct2_8x8_10_avx2: 124.7
inv_dct2_dct2_16x16_8_c: 4289.7
inv_dct2_dct2_16x16_8_avx2: 621.2
inv_dct2_dct2_16x16_10_c: 4335.2
inv_dct2_dct2_16x16_10_avx2: 625.2

perf.py results:

Bitstream Before After Delta
RitualDance_1920x1080_60_10_420_32_LD 99.7 99.3 -0.4%
RitualDance_1920x1080_60_10_420_37_RA 88.3 87.7 -0.5%
Tango2_3840x2160_60_10_420_27_LD 23.0 23.0 0.0%

The current perf.py performance is poor as the DCT's effect on overall decoding performance is dominated by the larger sizes which have not yet been implemented. The decrease in performance is explained by the additional overhead of optimising at the 2D level, the benefits of which are not being reaped here. As the larger sizes are implemented, performance will increase dramatically, in line with the checkasm benchmark result.

frankplow avatar Aug 27 '23 14:08 frankplow

hi @frankplow , seems the int32_t is only needed by range extension. If range extension is not enabled, we can keep the transform coeffs as int16_t. I will try to make some changes to this. Hope this will reduce the porting efforts.

nuomi2021 avatar Dec 05 '23 09:12 nuomi2021

hi @frankplow , seems the int32_t is only needed by range extension. If range extension is not enabled, we can keep the transform coeffs as int16_t. I will try to make some changes to this. Hope this will reduce the porting efforts.

Yeah I think if we take this approach, it shouldn't be too hard to get transforms implemented for the square sizes. Unfortunately, I think it will be hard to extend the HEVC optimisations to rectangular sizes and MTS as the way it's written doesn't facilitate much code reuse/modularity. I have a branch where I've worked on a more modular optimisation, based on some of the custom ABI ideas dav1d uses, but I'm having to write this from the ground up and don't have much time alongside my Master's at the moment. I think then, the best way to get optimisations in for these most common square sizes is to, as you say, allow varying the coeff type based on whether the range extension is active and then port the HEVC transforms.

frankplow avatar Dec 11 '23 13:12 frankplow

but I'm having to write this from the ground up and don't have much time alongside my Master's at the moment

No worries. I will continue your work after I have done the thread optimizations.

nuomi2021 avatar Dec 16 '23 08:12 nuomi2021