cmake: use -flto=auto compiler flag when supported, rework fast-math disablement

Open illwieckz opened this issue 1 year ago • 1 comments

cmake: use -flto=auto compiler flag when supported

Use -flto=auto compiler flag when supported, this silence this GCC warning:

lto-wrapper: warning: using serial compilation of # LTRANS jobs

This also greatly speeds-up the linkage time as it enables LTO multithreading in GCC (either by using Make jobserver if detected, either by detecting CPU cores).

Also always set LTO if enabled.

Similar to:

https://github.com/DaemonEngine/Daemon/pull/1571

cmake: rework the fast-math enablement and force the disablement

Some compilers may enable fast-math by default (example: ICC). Some contractions are still safe and can be enabled.

cmake: fix typo

Mar 05 '25 15:03 illwieckz

Any numbers on performance impact of the strict math flags? Reproducible floating-point operations are of somewhat niche interest and have limited chance to be reliably achieved between differing compilers and architectures, so I think that would be a bad default if it costs more than a few percent of speed.

It looks very niche to me to care about fast math in an offline conversion tool, the option can still be enabled for those wanting to feature it as a lib. In fact it's even safer for those building against crnlib to not have the fast math being enabled for their whole program in their back.

Speed difference between fast math or not is negligible (see below).

While reproducibility is fragile when fast math is disabled, it makes it impossible when it is enabled, I don't see the benefit of it.

The time difference may be exaggerated by slower image decoding (like slower PNG decoding).

Benchmarks

Environment

Test machine: 16 cores, 32 threads AMD Zen 2 CPU, DDR4 RAM with enough space to fit the whole benchmark corpus in cache.

Protocol

Images are put in memory cache before the runs and during the run images are read from memory cache and written to /dev/null. Crunch is a release build built with LTO enabled.

32 crunch executions are done in parallel (one crunch per CPU thread, one image per crunch execution), and each crunch execution uses 15 threads (because it uses max((nproc - 1), 15) by default).

About 10 runs are done sequentially for every configuration, the 5 best ones are kept (to filter out runs that may be disturbed by random unrelated tasks spawning in the background).

# Print number of images to convert.
find *.???dir \( -name '*.jpg' -o -name '*.png' -o -name '*.tga' \) | wc -l

# Preload image files into memory.
find *.???dir \( -name '*.jpg' -o -name '*.png' -o -name '*.tga' \) -print0 \
| parallel --will-cite -0 -I{} -P"$(nproc)" dd if={} of=/dev/null bs=1M status=none

# Convert images files.
time find *.???dir \( -name '*.jpg' -o -name '*.png' -o -name '*.tga' \) -print0 \
| parallel --will-cite --halt 0 -0 -I{} -P"$(nproc)" crunch -quiet -file {} -fileformat crn -out /dev/null

I only added --halt 0 to skip a Xonotic jpg that is not accepted by crunch (I will report), no image is skipped in other corpus.

Results

Corpus: UnvanquishedAssets

Files: 2656 images; 153 jpg, 2486 png, 17 tga.

Fast math times:

real	2m40,024s
real	2m40,504s
real	2m40,575s
real	2m40,610s
real	2m40,692s

Strict math times:

real	2m46,021s
real	2m46,026s
real	2m46,105s
real	2m46,517s
real	2m46,680s

Average time difference per image: 0.002 s

Corpus: InterstellarOasis

Files: 5160 images; 3097 jpg, 2063 png, 0 tga.

Fast math times:

real	1m20,441s
real	1m20,481s
real	1m20,557s
real	1m20,617s
real	1m20,531s

Strict math times:

real	1m23,275s
real	1m23,369s
real	1m23,084s
real	1m23,050s
real	1m22,908s

Average time difference per image: 0.0005 s

Corpus: Xonotic

Files: 10041 images; 2351 jpg, 101 png, 7589 tga.

Fast math times:

real	5m48,586s
real	5m51,403s
real	5m51,649s
real	5m51,784s
real	5m51,985s

Strict math times:

real	5m57,966s
real	5m58,812s
real	5m59.601s
real	6m0,113s
real	6m0,327s

Average time difference per image: 0.0008 s

Jul 17 '25 00:07 illwieckz