bc7enc_rdo
bc7enc_rdo copied to clipboard
bc7enc: Optimize "find approximate selector" branch chains
Description
Several BC7 code paths have branch chains which sequentially compare a value against an array of thresholds. These chains are long enough that compilers have trouble converting them to branchless operations.
In all of these code paths, the value produced by the branch chain is a direct dependency of of the subsequent code. This often results in a pipeline stall, because the branches can't be easily predicted.
To improve this, convert each branch chain to a branchless loop. Compiler optimizations will inline and unroll the loop, significantly improving codegen and making room for further compiler optimizations (such as auto-vectorization).
Results
I compiled bc7enc.exe
using Clang-CL 17 on Windows and ran it on an AMD RZ9-7950x system for this data. I did spot check MSVC and it appears to receive similar performance benefits.
Before changes:
Command: ./bc7enc.exe tv_albedo_1024x1024.png
Total encoding time: 0.197000 secs
Total processing time: 0.206000 secs
Command: ./bc7enc.exe camera-mountain-3024x4032.png
Total encoding time: 4.757000 secs
Total processing time: 4.772000 secs
After changes:
Command: ./bc7enc.exe tv_albedo_1024x1024.png
Total encoding time: 0.186000 secs
Total processing time: 0.195000 secs
Command: ./bc7enc.exe camera-mountain-3024x4032.png
Total encoding time: 4.429000 secs
Total processing time: 4.445000 secs
If needed, I can provide some images that show the difference in x86 codegen before and after the changes.