cmake: use -flto=auto compiler flag when supported, rework fast-math disablement
cmake: use -flto=auto compiler flag when supported
Use -flto=auto compiler flag when supported, this silence this GCC warning:
lto-wrapper: warning: using serial compilation of # LTRANS jobs
This also greatly speeds-up the linkage time as it enables LTO multithreading in GCC (either by using Make jobserver if detected, either by detecting CPU cores).
Also always set LTO if enabled.
Similar to:
- https://github.com/DaemonEngine/Daemon/pull/1571
cmake: rework the fast-math enablement and force the disablement
Some compilers may enable fast-math by default (example: ICC). Some contractions are still safe and can be enabled.
cmake: fix typo
Any numbers on performance impact of the strict math flags? Reproducible floating-point operations are of somewhat niche interest and have limited chance to be reliably achieved between differing compilers and architectures, so I think that would be a bad default if it costs more than a few percent of speed.
It looks very niche to me to care about fast math in an offline conversion tool, the option can still be enabled for those wanting to feature it as a lib. In fact it's even safer for those building against crnlib to not have the fast math being enabled for their whole program in their back.
Speed difference between fast math or not is negligible (see below).
While reproducibility is fragile when fast math is disabled, it makes it impossible when it is enabled, I don't see the benefit of it.
The time difference may be exaggerated by slower image decoding (like slower PNG decoding).
Benchmarks
Environment
Test machine: 16 cores, 32 threads AMD Zen 2 CPU, DDR4 RAM with enough space to fit the whole benchmark corpus in cache.
Protocol
Images are put in memory cache before the runs and during the run images are read from memory cache and written to /dev/null. Crunch is a release build built with LTO enabled.
32 crunch executions are done in parallel (one crunch per CPU thread, one image per crunch execution), and each crunch execution uses 15 threads (because it uses max((nproc - 1), 15) by default).
About 10 runs are done sequentially for every configuration, the 5 best ones are kept (to filter out runs that may be disturbed by random unrelated tasks spawning in the background).
# Print number of images to convert.
find *.???dir \( -name '*.jpg' -o -name '*.png' -o -name '*.tga' \) | wc -l
# Preload image files into memory.
find *.???dir \( -name '*.jpg' -o -name '*.png' -o -name '*.tga' \) -print0 \
| parallel --will-cite -0 -I{} -P"$(nproc)" dd if={} of=/dev/null bs=1M status=none
# Convert images files.
time find *.???dir \( -name '*.jpg' -o -name '*.png' -o -name '*.tga' \) -print0 \
| parallel --will-cite --halt 0 -0 -I{} -P"$(nproc)" crunch -quiet -file {} -fileformat crn -out /dev/null
I only added --halt 0 to skip a Xonotic jpg that is not accepted by crunch (I will report), no image is skipped in other corpus.
Results
Corpus: UnvanquishedAssets
Files: 2656 images; 153 jpg, 2486 png, 17 tga.
Fast math times:
real 2m40,024s
real 2m40,504s
real 2m40,575s
real 2m40,610s
real 2m40,692s
Strict math times:
real 2m46,021s
real 2m46,026s
real 2m46,105s
real 2m46,517s
real 2m46,680s
Average time difference per image: 0.002 s
Corpus: InterstellarOasis
Files: 5160 images; 3097 jpg, 2063 png, 0 tga.
Fast math times:
real 1m20,441s
real 1m20,481s
real 1m20,557s
real 1m20,617s
real 1m20,531s
Strict math times:
real 1m23,275s
real 1m23,369s
real 1m23,084s
real 1m23,050s
real 1m22,908s
Average time difference per image: 0.0005 s
Corpus: Xonotic
Files: 10041 images; 2351 jpg, 101 png, 7589 tga.
Fast math times:
real 5m48,586s
real 5m51,403s
real 5m51,649s
real 5m51,784s
real 5m51,985s
Strict math times:
real 5m57,966s
real 5m58,812s
real 5m59.601s
real 6m0,113s
real 6m0,327s
Average time difference per image: 0.0008 s