Noting that astc_internal.h and astcenc_mathlib.h at top of compile time from ClangBuildAnalyzer
Just letting you know that compile times building with astc encoder are a slower from repeated header includes of these two files. This is just parse time, since I don't have these in a precompiled header. That could cut these times away, but not everyone uses pch files but should. I've been finding this tool from Aras pretty insightful.
Including these fewer times would reduce the overall timings if they could be isolated.
*** Expensive headers: 2331 ms: astc-encoder/astcenc_internal.h (included 20 times, avg 116 ms), included via: astcenc_symbolic_physical.o (143 ms) astcenc_percentile_tables.o (141 ms) astcenc_quantization.o (138 ms) astcenc_weight_quant_xfer_tables.o (137 ms) astcenc_color_unquantize.o (135 ms) astcenc_platform_isa_detection.o (133 ms)
361 ms: astc-encoder/astcenc_mathlib.h (included 22 times, avg 16 ms), included via: astcenc_weight_quant_xfer_tables.o astcenc_internal.h (29 ms) astcenc_percentile_tables.o astcenc_internal.h (23 ms) astcenc_mathlib_softfloat.o (21 ms) astcenc_symbolic_physical.o astcenc_internal.h (19 ms) astcenc_block_sizes.o astcenc_internal.h (19 ms) astcenc_find_best_partitioning.o astcenc_internal.h (19 ms)
I can't actually move that into my astc_internal.h because of this
// these pull in string from system_error which is slow to instantiate on macOS
#include <condition_variable>
#include <mutex>
On macOS at least, mutex, thread, condition_variable, and random cause every file that includes these to instantiate 5 copies of basic_string even if in the pch. This then slows the overall build timings building these unused template types. mathlib might be possible to precompile.
Cutting out use of the mutex, condition_variable, and atomic with a define in ParallelManager dropped to this. I'm not using threads within astcenc. I have 1 thread or process per texture. So that cut 1.5s out. This also means these headers could be precompiled to pre-gen the sse/neon calls.
*** Expensive headers: 573 ms: /Users/Alec/devref/kram/libkram/astc-encoder/astcenc_internal.h (included 20 times, avg 28 ms), included via: astcenc_ideal_endpoints_and_weights.o (37 ms) astcenc_image.o (37 ms) astcenc_color_unquantize.o (36 ms) astcenc_compress_symbolic.o (36 ms) astcenc_compute_variance.o (35 ms) astcenc_entry.o (33 ms) ...
441 ms: /Users/Alec/devref/kram/libkram/astc-encoder/astcenc_mathlib.h (included 22 times, avg 20 ms), included via: astcenc_compute_variance.o astcenc_internal.h (28 ms) astcenc_mathlib.o (26 ms) astcenc_image.o astcenc_internal.h (26 ms) astcenc_entry.o astcenc_internal.h (25 ms) astcenc_compress_symbolic.o astcenc_internal.h (24 ms) astcenc_ideal_endpoints_and_weights.o astcenc_internal.h (24 ms) ...
The mathlib really is used everywhere, so I think that one is unavoidable.
I'd really like to avoid another build variant to maintain so I think the "proper fix" for the parallel manager is to split the context in to two, an inner context and an outer context. The outer context includes the parallel manager and related headers, and is only needed in the library entry layer. The rest of the codec uses only the inner context which doesn't include that.
Still needs a bit of clean up, but thinking something like this PR: #368.
On WSL compile times improve by ~20% in Debug builds, and ~10% for Release builds.
Merged.
Nice, you build the code a lot more than any of the rest of us. So glad it sped things up.