bc7enc_rdo bc7enc: Optimize "find approximate selector" branch chains

bc7enc: Optimize "find approximate selector" branch chains

Open abbriggs opened this issue 4 months ago • 0 comments

Description

Several BC7 code paths have branch chains which sequentially compare a value against an array of thresholds. These chains are long enough that compilers have trouble converting them to branchless operations.

In all of these code paths, the value produced by the branch chain is a direct dependency of of the subsequent code. This often results in a pipeline stall, because the branches can't be easily predicted.

To improve this, convert each branch chain to a branchless loop. Compiler optimizations will inline and unroll the loop, significantly improving codegen and making room for further compiler optimizations (such as auto-vectorization).

Results

I compiled bc7enc.exe using Clang-CL 17 on Windows and ran it on an AMD RZ9-7950x system for this data. I did spot check MSVC and it appears to receive similar performance benefits.

Before changes:

Command: ./bc7enc.exe tv_albedo_1024x1024.png
Total encoding time: 0.197000 secs
Total processing time: 0.206000 secs

Command: ./bc7enc.exe camera-mountain-3024x4032.png
Total encoding time: 4.757000 secs
Total processing time: 4.772000 secs

After changes:

Command: ./bc7enc.exe tv_albedo_1024x1024.png
Total encoding time: 0.186000 secs
Total processing time: 0.195000 secs

Command: ./bc7enc.exe camera-mountain-3024x4032.png
Total encoding time: 4.429000 secs
Total processing time: 4.445000 secs

If needed, I can provide some images that show the difference in x86 codegen before and after the changes.

Sep 24 '24 21:09 abbriggs

bc7enc_rdo bc7enc_rdo copied to clipboard

bc7enc: Optimize "find approximate selector" branch chains

Description

Results

bc7enc_rdo
bc7enc_rdo copied to clipboard