hpx Performance in recursive fork-join is very bad

I maintain a set of benchmarks to compare the performance of executors. https://github.com/tzcnt/runtime-benchmarks currently it mostly tracks fork-join performance. I have recently added HPX to this benchmark suite in PR https://github.com/tzcnt/runtime-benchmarks/pull/4.

I was shocked to discover that HPX on average is 900x slower than the fastest runtime. See the summary table on the repo README, or the full dataset chart here: https://fleetcode.com/runtime-benchmarks/ Note that I have only run this on one machine so far (with 128GB ram) - because on a different machine, it would take literally hours to complete the benchmark run, or run out of memory... whereas the competing runtimes can complete a benchmark run in a few minutes.

You can read the PR description above for a complete description of all the things that I tried, but the summary is that I wasn't able to achieve reasonable performance despite spending many hours trying different approaches.

I also discovered that the abp-priority-lifo scheduler is broken and doesn't actually use the correct scheduling policy. I had to manually edit this line to reference hpx::threads::policies::lockfree_abp_lifo instead, after which I was able to I verify that the correct scheduler was being used.

I expected that this would solve the problem as the memory blowup that I observed is usually what happens when runtimes use strictly FIFO scheduling, which can result in e.g. for Skynet, 10^8 = 100,000,000 active tasks maximum, rather than 10*8 = 80 active tasks maximum when using LIFO scheduling for the local task queue. I thought that the lockfree_abp_lifo policy would prevent this problem, but it didn't make any impact.

I see this issue https://github.com/STEllAR-GROUP/hpx/issues/3348 and this PR https://github.com/STEllAR-GROUP/hpx/pull/4744 but I'm not sure if this would resolve this problem.

I'd also like to note that 3 of the benchmarks I'm running have nearly equivalent implementations in your examples, so I'd expect that it would be possible to make these run in a performant manner. https://github.com/STEllAR-GROUP/hpx/blob/master/examples/quickstart/fibonacci_one.cpp https://github.com/STEllAR-GROUP/hpx/blob/master/tests/performance/local/skynet.cpp https://github.com/STEllAR-GROUP/hpx/tree/master/examples/nqueen

I'd like to give you an opportunity to present your library in the best light, so I'm open to any feedback on how I can optimize my implementation / build flags to make HPX perform better.

Oct 20 '25 01:10 tzcnt

Hello @tzcnt, thanks so much for putting together this benchmark.

One thing I noticed is that you manually enable HPX_WITH_CXX11_ATOMIC_128BIT. This is supposed to be an internal flag, set automatically at configuration time. It's ON if your platform supports lockfree atomics on 128-bit types.

Forcing that ON in the wrong platform will cause performance to deteriorate, as HPX will mistakenly use 128-bit atomics as if they are lock-free. This might explain why so much time is spent in kernel space.

Oct 20 '25 04:10 Pansysk75

This machine definitely has lock-free 128bit atomics. However HPX did not detect them automatically. I am using Clang and it has known issues with this https://github.com/llvm/llvm-project/issues/75081. I assumed that if I forced this to ON and then built both HPX and my application with -mcx16, the appropriate 128bit instructions would be generated.

If my assumption was incorrect, then what's the solution?

use libc++ instead of libstdc++?
use a different scheduler (which do you recommend?)

However, I don't think that's the issue, based on my empirical testing:

I commented out these three lines in the CMakeLists.txt and reconfigured:

# set(HPX_WITH_CXX11_ATOMIC_128BIT ON CACHE BOOL "")
# add_definitions("-DHPX_HAVE_CXX11_STD_ATOMIC_128BIT")
# add_compile_options("-mcx16")

I verified that building and running the program gives the error "Command line option --hpx:queuing=abp-priority-lifo is not configured in this build. Please make sure 128bit atomics are available."

I removed the --hpx:queuing command line parameter and rebuilt/reran both the fib and nqueens benchmarks (with reduced problem size). The performance was identical (<1% deviation) to the original.

I did a test run with perf on the original code and found a cmpxchg16b in the generated release assembly. This indicates that the forced override to enable this feature is working.

Samples: 684K of event 'cycles:P', 4000 Hz, Event count (approx.): 429407081042                                                                                
hpx::lockfree::deque<hpx::threads::policies::thread_queue<std::mutex, hpx::threads::policies::lockfree_abp_lifo, hpx::threads::policies::lockfree_fifo, hpx::th
   0.05 │       mov     %rcx,%rsi                                                                                                                             ▒
   0.07 │       add     %rax,%rcx                                                                                                                             ▒
   0.04 │       mov     0x10(%rsp),%rax                                                                                                                       ▒
   0.05 │       or      %rdi,%rcx                                                                                                                             ▒
   3.63 │       lock    cmpxchg16b (%r14)                                                                                                                     ▒
   0.51 │     ↑ jne     9e                                                                                                                                    ▒

I also looked for references to any function named atomic in the perf output, where I would expect the polyfill implementation of this to be present if 128bit atomics were not enabled, and I did not find any such function names.

Oct 20 '25 05:10 tzcnt

Okay, it makes sense why you had to override that then. And that snippet from perf is exactly what I had in mind. So that is probably not the issue.

@hkaiser

Oct 20 '25 05:10 Pansysk75

I was shocked to discover that HPX on average is 900x slower than the fastest runtime. See the summary table on the repo README, or the full dataset chart here: https://fleetcode.com/runtime-benchmarks/ Note that I have only run this on one machine so far (with 128GB ram) - because on a different machine, it would take literally hours to complete the benchmark run, or run out of memory... whereas the competing runtimes can complete a benchmark run in a few minutes.

If you had asked beforehand you may have been able to save yourself from being 'shocked' ;-)

Using the fork-join executor in recursive ways will lead to significant performance deterioation. This executor creates one thread per core (actually N-1 threads), binds those to the cores and lets them spin-wait for work to be submitted. It's meant to be used for several parallel algorithms that are used in a row.

This machine definitely has lock-free 128bit atomics. However HPX did not detect them automatically. I am using Clang and it has known issues with this llvm/llvm-project#75081. I assumed that if I forced this to ON and then built both HPX and my application with -mcx16, the appropriate 128bit instructions would be generated.

Even if you force HPX to use 128-bit atomics on platforms that provide those, it will still rely on the standard library's implementation of std::atomic<uint128_t>, which is not lock-free most of the time (from what I have seen, all compilers rely on a conservative implementation in glibc in order to support very old hardware). There is some work being done (very slowly) to work around this issue here: https://github.com/STEllAR-GROUP/hpx/pull/6521.

For the actual problem you're describing. Please give me some time to investigate and understand the issue. For now, I'd suggest sticking to the default executor.

Oct 20 '25 11:10 hkaiser

After further profiling, I believe the issue is madvise(MADV_DONTNEED) causing IPI and tlb flush on the other processors. The issue seems to be similar to what's discussed at https://github.com/bytecodealliance/wasmtime issue 4637 (trying to avoid generating any extra unnecessary crosslinks). Quote from this issue: "Allocation of fiber stacks for asynchronously executing WebAsembly primarily goes through mmap right now and unmapping a stack requires broadcasting to other threads with an IPI which slows down concurrent execution."

If I understand correctly that 1 fork = 1 fiber = 1 stack = 1 unmap = 64 IPIs = 64 TLB flushes on a 64-core processor, and these benchmarks do billions of forks... then that would explain what I'm seeing in the perf trace - and why it doesn't seem to matter which scheduler I use.

Here's the top entry in the perf trace for fib 30 expanded with the callgraph:

Samples: 320K of event 'cycles:P', Event count (approx.): 195313137990                                                                                         
  Children      Self  Command          Shared Object                  Symbol                                                                                   
-   71.70%     0.00%  main-thread#0    [kernel.kallsyms]              [k] entry_SYSCALL_64_after_hwframe                                                      ◆
   - 71.70% entry_SYSCALL_64_after_hwframe                                                                                                                    ▒
      - do_syscall_64                                                                                                                                         ▒
         - 70.85% __x64_sys_madvise                                                                                                                           ▒
            - do_madvise                                                                                                                                      ▒
               - 69.66% zap_page_range_single                                                                                                                 ▒
                  - 67.72% tlb_finish_mmu                                                                                                                     ▒
                     - 66.87% flush_tlb_mm_range                                                                                                              ▒
                        - 65.98% on_each_cpu_cond_mask                                                                                                        ▒
                           - smp_call_function_many_cond                                                                                                      ▒
                              - 27.46% llist_add_batch                                                                                                        ▒
                                 - 6.38% asm_sysvec_call_function                                                                                             ▒
                                    - 4.43% sysvec_call_function                                                                                              ▒
                                       - 4.42% __sysvec_call_function                                                                                         ▒
                                          - 4.41% __flush_smp_call_function_queue                                                                             ▒
                                               2.02% llist_reverse_order                                                                                      ▒
                                               1.35% flush_tlb_func                                                                                           ▒
                              - 3.73% asm_sysvec_call_function                                                                                                ▒
                                 - 2.90% sysvec_call_function                                                                                                 ▒
                                    - 2.88% __sysvec_call_function                                                                                            ▒
                                       - 2.88% __flush_smp_call_function_queue                                                                                ▒
                                            1.21% llist_reverse_order                                                                                         ▒
                                            1.04% flush_tlb_func                                                                                              ▒
                              + 2.26% default_send_IPI_mask_sequence_phys                                                                                     ▒
                     + 0.55% __tlb_batch_free_encoded_pages                                                                                                   ▒
                    0.64% unmap_page_range                                                                                                                    ▒
                  + 0.59% lru_add_drain                                                                                                                       ▒
                    0.55% tlb_gather_mmu                                                                                                                      ▒
                 0.77% down_read                                                                                                                              ▒
+   71.70%     0.02%  main-thread#0    [kernel.kallsyms]              [k] do_syscall_64                                                                       ▒
+   71.22%     0.06%  main-thread#0    libc.so.6                      [.] __madvise                                                                           ▒
+   70.87%     0.06%  main-thread#0    [kernel.kallsyms]              [k] do_madvise

Oct 20 '25 14:10 tzcnt

I got between 3-10x speedup by setting HPX_WITH_THREAD_STACK_MMAP=OFF. This removes all the syscall time from the perf trace. I can now I will start the process of benchmarking the comparative implementations once again. To aid me in this process, perhaps you can answer a few questions?

Which scheduler would you recommend for this type of workload? Am I correct that lifo scheduler would be best? What about workrequesting?
Is there a true stackless coroutine type in HPX, or only hpx::future / hpx::async which is stackful?
Is there a more efficient way to fork multiple tasks in parallel than hpx::async()?
Which do you recommend - hpx::wait_all() vs. co_await hpx::when_all() for performance?

Oct 20 '25 14:10 tzcnt

After setting HPX_WITH_THREAD_STACK_MMAP=OFF and experimenting with every combination of build-time parameters and schedulers, I've uploaded new benchmark implementations which improves HPX performance to some degree. The various different schedulers didn't seem to have a meaningful impact on the performance, so I have disabled the 128 bit atomic hack and removed the manual scheduler setting to simplify things.

What did make a difference was setting the undocumented launch parameter hpx::launch::fork. This seems to increase memory consumption but also benchmark performance, so on the benchmarks where this tradeoff could be made (I cannot allow memory consumption to exceed 30GB or I won't be able to run the same code on all of my machines), I enabled it.

From my profiling of the various benchmarks, they all seem to be allocation bound, and regularly allocate far more memory than I would expect for lazy / LIFO forking (which is what the highest performance libraries use). If I set hpx::launch::deferred, then the memory usage stays low and flat, like I would expect, and I still see 64 threads running, but the performance is much slower! fib runs in 63 seconds instead of 15, nqueens in 28 rather than 3, and skynet in 57 rather than 15. Whereas like I said earlier, setting hpx::launch::fork uses more memory and also is faster - which is quite counterintuitive.

At this point I am out of ideas to try, so I await your feedback and/or welcome a PR against my repo. Thanks for your time.

Oct 22 '25 03:10 tzcnt