libcoro icon indicating copy to clipboard operation
libcoro copied to clipboard

Performance in recursive fork/join is very poor

Open tzcnt opened this issue 2 months ago • 7 comments

I maintain a set of benchmarks to compare the performance of executors. https://github.com/tzcnt/runtime-benchmarks currently it mostly tracks fork-join performance. Libcoro was added recently and has been underperforming.

I want to give you the opportunity to present libcoro in the best light. Perhaps you can take a look at my benchmark implementation and let me know if there is a way to optimize? The libcoro implementation is here: https://github.com/tzcnt/runtime-benchmarks/tree/main/cpp/libcoro

I'm looking to add some I/O benchmarks soon to compare some libraries like boost::cobalt and my own tmc-asio and I'm excited to benchmark against libcoro, as it's one of the only libraries to provide its own networking implementation.


Some relevant notes on the implementation / docs of libcoro:

thread_pool uses a mutex to wrap a std::deque which is shared among all threads. This seems to be the major performance killer, with 96% of the performance profiler being spent in kernel native_queued_spin_lock_slowpath. This could be progressively optimized like so:

  • Use a deque + mutex per thread. Have threads push to their own local deque and consume from their own local deque first, before checking each other thread to steal work.
  • Have threads push and pop (LIFO) from their own local deque, but pull (FIFO) from other threads. This maximizes the efficiency of work stealing. It also limits the number of maximum active tasks by effectively changing the traversal of the task graph from a BFS to a DFS.
  • Replace the mutex locking with a lock-free MPMC queue.

The requirement to call tp.schedule() to fork introduces additional overhead, as the task must be resumed at least once, only to be immediately suspended and then posted to the executor. It would be much more efficient to introduce a fork() function which directly enqueues the task to the executor without commencing execution. For when_all(), all but one task should be forked, and the last task should be resumed by symmetric transfer.

I noticed that when_all_task_promise::final_suspend::completion_notifier::await_suspend returns void and synchronously calls coroutine.promise().m_latch->notify_awaitable_completed(); which synchronously calls m_awaiting_coroutine.resume();. This could lead to stack overflow in the case of deeper recursion. If you use symmetric transfer here instead, this can be avoided. I saw this pattern in a few other places as well... this is really missing out on the benefits of C++20 coroutines.

What is the coro::task_container type? It's mentioned in the docs but I don't see an actual implementation (just a template parameter that references a task_container_type).

tzcnt avatar Oct 20 '25 15:10 tzcnt

Hey thanks for opening an issue and your thoughts on the problem. I'm not entirely shocked it's the bottleneck, I think a MPMC work stealing lockless queue implementation is probably going to be the best choice to replace it for maximum efficiency.

I'm currently out for the week on a bike trip so I likely won't be able to really give a good answer/resolution for a bit but this is certainly something I think is worth improving, the std lib constructs are just not that fast.

jbaldwin avatar Oct 22 '25 02:10 jbaldwin

https://github.com/cameron314/concurrentqueue

jbaldwin avatar Oct 27 '25 21:10 jbaldwin

coro::task_container Is a way to track task lifetimes outside of a scheduler or thread_pool, its sort of like a parking lot to track a set of tasks until they are completed.

The major usecase I've seen so far is having a set of tasks spun off on a thread pool or scheduler to track when they are all completed, its effectively IMO just a coro::latch or coro::when_all so at this point its likely a redundant data structure.

jbaldwin avatar Oct 28 '25 13:10 jbaldwin

IIUC task_container is like tbb::task_group? I found this type of construction to be very useful actually, sometimes it's nice to construct this group imperatively rather than passing all subtasks at once.

tzcnt avatar Oct 28 '25 14:10 tzcnt

IIUC task_container is like tbb::task_group? I found this type of construction to be very useful actually, sometimes it's nice to construct this group imperatively rather than passing all subtasks at once.

group is a nice name, I might steal that 😉. But yes I think so.

I think to run your benchmarks you also need boost dev installed, concurrentcpp failed to build without it.

jbaldwin avatar Oct 28 '25 14:10 jbaldwin

Noted, I think the boost dependency was introduced recently when I added HPX. I'll update the README, sorry about that.

I've been meaning to also update the overall runner script to allow passing specific runtimes as a command line parameter list, but for now you can just edit line 15 of build_and_bench_all.py and remove any that you don't want to run.

tzcnt avatar Oct 28 '25 15:10 tzcnt

Well I've had an interesting fun time trying the moodycamel::ConcurrentQueue and I've gotten all by skynet within 2x to 4x slower than cppcoro, but I think its hit its limits. Likely needs a custom solution to get closer to other library implementations. It also has weird semantics on ordering and issues with thread_pool::yield() with a single threaded thread pool that can causing infinite loops on certain tests.

The requirement to call tp.schedule() to fork introduces additional overhead, as the task must be resumed at least once, only to be immediately suspended and then posted to the executor. It would be much more efficient to introduce a fork() function which directly enqueues the task to the executor without commencing execution. For when_all(), all but one task should be forked, and the last task should be resumed by symmetric transfer.

Yeah this is a good point. I'll open a ticket to upgrade when_all().

jbaldwin avatar Oct 28 '25 23:10 jbaldwin