vvc_deblock.c: fix RANDCLIP
Previously RANDCLIP(x, diff) was computing x - diff and then clipping it between (0, max_pixel_val + rnd() % 2 * diff). This means we're not really generating a random value in the range.
Instead compute (x - diff) + rnd() % 2 * diff. This returns a value such that abs(value - x) < diff.
This greatly improves the generation of strong deblocking data.
@nuomi2021 been looking into further improvements to the luma generation, it seems fairly non-trivial.
One of the main issues is occasionally (d0 << 1) < beta_2 condition fails in the filter template.
Where d0 = abs(p2 - 2 * P1 + P0) + abs(Q2 - 2 * Q1 + Q0)
The current code does actually try to compensate for it, since (d0 << 1) < beta_2 == d0 < (beta_2 >> 1) which is beta_3.
It becomes difficult to solve both constraints while also satisfying (d0 + d1 < beta).
I attempted to put it into a computer algebra solver (wxMaxima) but it's quite messy.
If it's difficult, perhaps we should approach it as it is. I'll rebase the code to the latest version to fix the fuzz issue. Then, we can work together on two tasks:
- Enabling a larger filter for luma—this is the last missing part.
- Enabling AVX2—this will further improve performance.
Which one do you prefer?
If it's difficult, perhaps we should approach it as it is. I'll rebase the code to the latest version to fix the fuzz issue. Then, we can work together on two tasks:
- Enabling a larger filter for luma—this is the last missing part.
- Enabling AVX2—this will further improve performance.
Which one do you prefer?
AVX2 sounds good, we need to modify the C side to expose multiple blocks right? I'm trying to learn more about video decoding overall some more exposure to the c would be good.
If it's difficult, perhaps we should approach it as it is. I'll rebase the code to the latest version to fix the fuzz issue. Then, we can work together on two tasks:
- Enabling a larger filter for luma—this is the last missing part.
- Enabling AVX2—this will further improve performance.
Which one do you prefer?
AVX2 sounds good,
👍
we need to modify the C side to expose multiple blocks right?
Yes, we need to set up parameters for a single line within a CTU. SSE can process 16 bytes at a time, AVX2 can handle 32 bytes, and AVX-512 can manage 64 bytes per operation.
I'm trying to learn more about video decoding overall some more exposure to the c would be good.
You can start from https://www.amazon.com/Coding-Video-Practical-Guide-Beyond/dp/1118711785 :)