fiber icon indicating copy to clipboard operation
fiber copied to clipboard

A few notes on work_stealing

Open lenerd opened this issue 5 years ago • 2 comments

Hi all,

I wanted to implement a thread pool for executing fibers using the work_stealing algorithm from the library. While doing this I ran into a couple of problems. Maybe these are intended limitations of the library but I have not anticipated them by just reading the documentation. So I thought I just post here what I noted while using Boost.Fiber.

  • You cannot use multiple instances of work_stealing scheduler concurrently in an application:

    work_stealing maintains the list and number of its instances in two static class members: https://github.com/boostorg/fiber/blob/1bc393b3ba8357787d045f6a13a5f7c0de1769d2/include/boost/fiber/algo/work_stealing.hpp#L39-L40 Therefore, we cannot have two disjoint groups of threads using work_stealing to distribute the fibers among themselves.

  • You cannot use multiple instances of work_stealing scheduler sequentially in an application:

    The static member counter_ is initialized with 0 https://github.com/boostorg/fiber/blob/1bc393b3ba8357787d045f6a13a5f7c0de1769d2/src/algo/work_stealing.cpp#L26 and incremented as soon as a thread arrives https://github.com/boostorg/fiber/blob/1bc393b3ba8357787d045f6a13a5f7c0de1769d2/src/algo/work_stealing.cpp#L36-L37 The value (saved in id_) is then used as vector index of the current thread. However, there is no way to reset counter_ to 0. Hence, we cannot use work_stealing a second time.

  • You need to have at least two threads to execute work_stealing algorithm:

    Otherwise, the single thread may try to steal from a random thread that is different from itself and cannot succeed.

    It may not make much sense to perform the algorithm with a single thread but this way you cannot just change one parameter in order get a single worker thread performing the tasks.

The first points I have circumvented by adapting the code of work_stealing as pooled_work_stealing and moving the static variable into a pool_ctx class (see lenerd/fiber_tp).

  • You need to keep the "main" fibers running until all fibers managed by the scheduler are executed: There is something written in the documentation: Keep workers running

    In order to keep the worker thread running, the fiber associated with the thread stack (which is called “main” fiber) is blocked. For instance the “main” fiber might wait on a condition_variable. For a gracefully shutdown condition_variable is signalled and the “main” fiber returns. The scheduler gets destructed if all fibers of the worker thread have been terminated.

    Initially I understood this paragraph (and especially the last sentence) in the way that I need to prevent the "pool" to get empty (similarly to using io_context::work in Boost.Asio). However, if I understood the behavior correctly, the "main" fiber associated with the thread stack needs to stay alive until all other fibers are completed.

This leads to the next point:

  • Assume we have such a pool of worker threads and we have created all fibers that we want to run in this pool. Moreover, let there be a fiber which is currently blocked, e.g., waiting for a future whose value is set by some thread outside of the pool. Before I am allowed to close the pool and let the "main" fibers terminate, I have to be sure that this fiber is also completed.

    Is there a way to find out if there are still (possibly blocked) fibers waiting to get completed?

That's it so far. Thanks for the nice library.

lenerd avatar Oct 30 '19 17:10 lenerd

While benchmarking this library and diving in a lot of segmentation violations, I think I just hit this static member problem...

There is also a similar problem in the NUMA version https://github.com/boostorg/fiber/blob/develop/include/boost/fiber/numa/algo/work_stealing.hpp#L41 and the shared-work scheduler https://github.com/boostorg/fiber/blob/develop/include/boost/fiber/algo/shared_work.hpp#L40

keryell avatar Mar 11 '20 18:03 keryell

Ran into this today:

You need to have at least two threads to execute work_stealing algorithm

I tried to create work_stealing with 1 thread, but it hung in this_fiber::sleep_for(). This limitation should be documented.

yotann avatar Mar 10 '22 21:03 yotann