hashlink Potential 1.15 performance regression

Our benchmarks show a performance regression starting on March 23rd, which coincides with the 1.15 release. This applies to both Haxe nightlies and Haxe 4, and affects both HL/JIT and HL/C.

It's particularly bad for the allocation-bound benchmarks like mandelbrot, but some normal ones like the formatter also took a hit. The only one that got better is the Dox benchmark, which suggests some improvement related to dynamic field handling or something along those lines.

@Apprentice-Alchemist mentioned that the benchmarks went from HL 1.13 to 1.15, so this might be an older regression introduced in 1.14.

Mar 28 '25 16:03 Simn

Bisected the mandelbrot performance regression to https://github.com/HaxeFoundation/hashlink/commit/9289265cbb6418e6ae4aae6e80063679d62ab032 (added gc parallel marking)

When testing on latest master, setting GC_MAX_MARK_THREADS to 1 makes it perform well again, whereas setting GC_MAX_MARK_THREADS to anything more regresses performance. Constraining the number of threads to 1 at runtime does not improve performance.

When constrained to one thread at runtime but with GC_MAX_MARK_THREADS > 1, the only thing that changes is that atomic_bit_set uses an atomic intrinsic. When there is only one mark thread this function is only used on this line in gc_flush_mark: https://github.com/HaxeFoundation/hashlink/blob/6794cdbe4407d26f405e5978890de67d4d42a96d/src/gc.c#L765

So my current theory is that the overhead of atomics outweighs the advantage of parallel mark threads, at least for allocation-heavy code like mandelbrot.

Mar 28 '25 19:03 Apprentice-Alchemist

That's quite an interesting result. I wonder if we get the same performances differences on Windows or if it comes down to the implementation of atomic_bit_set depending on the CPU instructions used (in which case we might need a few additional gcc flags to see if it can be improved).

The benefit of using multiple marking threads is mainly when having large memory set that doesn't fit into the CPU cache, in which case using all cores caches and parallelization of the DRAM waits is optimal. If the app OTOH is CPU bound and not DRAM bound, the atomic operations will be an overhead.

Maybe one solution would be to dynamically adjust the number of actually used threads based on the total amount of current GC memory.

Ping @yuxiaomao

Mar 31 '25 08:03 ncannasse