hpx improve scheduling overheads to support fork-join style work (was N-ary tasks)

New feature

During recent discussions with the kokkos people and also as a result of our failed attempt to have executors made variadic, we speculated that task management could be simplified in hpx by making some large changes to the way tasks are managed/created.

currently hpx::async forwards arguments to executors, tasks are created and eventually passed to the scheduler. The thread_data struct holds our low level task https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/runtime/threads/thread_data.hpp and is created somewhere between the executor and scheduler using a function like register_thread_nullary (or similar, I don't recall exactly)
when a parallel::for_loop is created, we might create 32/64/128/more individual tasks that go onto queues and require a considerable amount of management. In kokkos they just create a single task with an atomic counter with value 32/64/128/more in it that each thread decrements until it reaches zero. Each thread takes a 'chunk' of work corresponding to the index of the counter and the original ranges of the loop. We would need to be careful about iterator requirements (random access/forward etc). but if a loop over data that was iterable required 128 tasks, we could use the same strategy.
Variadic executors help us introspect arguments and take action based on values. We could promote the creation of thread_data to the async/executor interface and move the arguments into the callable in such a way that we can still access them. Our lowest level deferred callable type would become a function object with extra information. This would help us pass information to the scheduler and we could create n-ary tasks from parallell::for_loop and other parallel::algorithms that were simple iterations over ranges.
@sithhell hinted that we might move the shared state of the future into this struct to save extra mallocs between task create/destroy.
Schedulers could use lists instead of queues and by using intrusive pointers we could store prev/next in the thread_data struct and elide more mallocs for list nodes.
Having non-stealable tasks that use affinity to bind to cores would be simplified because we could add non-steal flags to the thread_data and the scheduler could steal or take work from the middle of a queue and no just the front/back.

This idea needs more thought, but I believe that a reworking of the task creation process in HPX could produce significant gains in performance. Please review this list and add more thoughts or comment on the above.

Jun 14 '18 12:06 biddisco

Discussion on IRC (@sithhell @msimberg) and elsewhere have raised interesting points that need to be itemized and acted on. Our principle performance related problem is fork-join parallelism and being competitive with other libraries such as OpenMP etc. Strategies that could help are:

No stealing of tasks once they have moved to the Pending state. This avoiding atomic operations in the work queues. (we would need to be aware of progress guarantees for threads that are suspended and never stolen, but are blocked by the current thread being busy).
Stacks should be reused. There is no need to allocated or swap into a new stack if the previous task has completed. Just reuse the current stack frame When a task completes, the current stack should always be used unless we are resuming a suspended task that has a stack already. Note that this is already done with the current staged/pending queues. We should make sure that a rewrite of the schedulers keeps this, and potentially improves on the way we do it now.
Tasks could be reused. Not only the stack can be reused but a task that has finished could be reused for a task that hasn't been started yet to avoid two context switches.
If we also allow tasks to come in 2 flavours - the user decorate the task in some way to signal that it cannot suspend/have nested parallelism, we can detect violations at run time since no calls to executors or futures would be allowed in these tasks
- Tasks that Do suspend (allowed to use futures etc) / Tasks that Do Not suspend (not allowed) It becomes possible at compile time to do different things for both. Non suspending tasks could even run on the stack frame of the scheduling loop and have no context switch at all. They are just a function invocation.
Parallel fork join regions. Using the existing task-blocks, or some other mechanism to switch from normal task operation to fork join regions. This would incur the cost of draining thread queues and adding pointless waits, but if the regions where users want to do openmp-like operations are big/expensive enough, then these waits can be considered small overheads in total. Overlapping of communication and computation loses out in this approach if we stop hpx doing the kind of background progress we need for network activity. (Separate IO threads a possibility). Do we explore OpenMP-executors to switch mode from HPX-like to OpenMP like regions?
Low cost N-ary tasks (only 1 task for N chunks) that are always executed first on any core. When a parallel::for loop creates N chunks, if we know that the tasks are not suspending (no recursive parallelism), then they can be assigned to the current worker threads using the correct chunking strategy and always executed immediately when the worker thread becomes available. They would bypass the normal queues and be run directly on worker threads. There would not be a need for a fork-join region (as above) since waiting for the currently executing tasks to complete would occur anyway. OpenMP does not have the overhead of creating and managing these tasks, so we should aim to do the same.
@Heller, please add your description of changes to the scheduling loop that you put on irc
Thread pools. One could imagine having normal HPX tasks running on one thread pool, and an OpenMP type thread pol - both running on the same cores, but with one sleeping whilst the other works, OpenMP leaf node threads would always have priority and then a stripped down scheduler for OpenMP type operation would run on that pool, and a conventional thread pool take over when it was out of work. (Would make integration with openmp simple if we actually used openmp to do this).

Nov 14 '18 12:11 biddisco

Will expand with other points as I remember, but here's one that we discussed earlier:

Distributing tasks created by parallel for loops etc over threads by groups of chunks instead of straight forward round-robin. In other words, with t threads and n chunks, the first thread would get the first n/t chunks, the next thread the next n/t chunks and so on. This, together with stealing tasks from the end (ideally one would only steal tasks from the same parallel for invocation), should help improve cache use. This is related to N-ary tasks of course.

Nov 14 '18 13:11 msimberg

The idea of implementing task_regions or fork_join regions to allow us to switch between normal hpx task stealing and fork-join type work for simpler loops is nice, but ultimately, if we have to wait for queues to drain, then adding fork-join regions will be painful if they are frequently used within regions of 'normal' hpx task stealing.

It would therefore make sense to allow fork-join regions when requested by the user, inside these regions, a different scheduler could operate. It would be a bare minimum openmp type scheduler that would hopefully perform as well as openmp, by using N-ary tasks, running tasks on the scheduler thread directly without any context switch, and not allowing any suspension of tasks.

Coroutines require a keyword that the compiler can use to treat them specially. If we had tasks that were suspendable and others that were not suspendable, then we would know when we needed to worry about stack swapping. A 'non suspending task' type keyword in hpx would allow us to implement tight parallel_for style tasks without stack swapping and provide for a more optimized openmp style workload.

An execution policy with 'non_suspend' attribute might be an approach we could use to enforce tasks that are not allowed to call any suspend/resume functionalities and this would allow us to create N-ary tasks that were executed directly on the scheduling loop thread and always before any other tasks. A beyond high-priority class of tasks. N-ary tasks would not allow suspension, no nested parallelism (an assert mechanism could ensure this, if we can't enforce it at compile time). Using this approach would permit us to spawn parallel_for loops from within existing hpx style code that would run as soon as the next task on each thread was available, thus avoiding waiting for queues to drain and interoperating better with the existing system, but by limiting the scope of these tasks to leaf-node style openmp/kokkos work, we would hopefully combine the best of both worlds.

Nov 27 '18 09:11 biddisco

@sithhell - Have you had any thoughts about this - and does the work you are doing with the scheduler go in this kind of direction? Is there another/better way of achieving parity performance with OpenMP for the parallel-for style tight loops?

Dec 04 '18 17:12 biddisco

I think our problems are two fold. We need solutions being developed from top-down (the algorithm point of view) and bottom-up (optimization in the scheduling). They can progress more or less independently. The fork/join pattern is currently not very well implemented (in terms of performance), that means we have a lot of performance on the table due to dynamic memory allocation of the shared states etc. From the top-down view developing an algorithm with fork/join in mind should be done, I haven't put much thought in how this could look like, but I'm opposed to add special scheduling support (apart from affinity of tasks). This is where the bottom up aspect would come into play, where we need to optimize our scheduling algorithms (which is what I'm working on). With that, those n-ary tasks, can be implemented as a local broadcast kind of thing, for example.

Dec 05 '18 11:12 sithhell

We currently have two different higher-level constructs encapsulating local fork/join semantics: task_block and spmd_block. While the first currently uses a list of futures for synchronization, the second one uses a barrier. I have not compared their performance.

I think that from the algorithmic standpoint we should start with improving the implementation (performance-wise) of either one of those and resolve to something new only if we can't get it coined out properly.

Another possible point of implementing optimal fork/join semantics would be the bulk_sync_execute/bulk_async_execute executor functionalities (spmd_block defers to that, for instance). Improving those would probably automatically improve parallel algorithm performance as well.

Dec 05 '18 13:12 hkaiser

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 04 '19 06:07 stale[bot]

Issues #3348 and #3553 should not be closed as they are under consideration at the time of writing.

Jul 10 '19 07:07 biddisco

I did some experimenting with different ways of launching tasks in a fork-join style and ended up with the conclusion that anything that has to spawn new tasks for each parallel region ends up paying a significant overhead (i.e. the minimum feasible grain size is significantly limited). Even stackless tasks end up going through a lot of code until they are actually executed.

I'm sure there are many ways to address this, but one way of getting around this in "user code" (i.e. executors) is reusing threads and busy waiting (what Kokkos and OpenMP do). This means that a parallel region can start in nano or at most a couple of microseconds (after the first launch). The downside is that it blocks a full worker thread even while waiting, but this may be an acceptable tradeoff. I think having such an executor would be interesting to have as an intermediate step in transforming OpenMP applications to HPX (and I'd volunteer to write it, I already have some pieces in place). Having it as an executor doesn't force blocking semantics on anyone, but gives them the option to use that if needed. There's also no reason why an implementation like this wouldn't get very close to the performance of OpenMP or Kokkos parallel regions.

The interesting extension would then be if we can have something in between, where the busy loops can yield occasionally, and perhaps even return futures to the completion of a particular bulk execution, without affecting the startup time of a parallel region significantly.

May 06 '20 08:05 msimberg

Agreed. I even floated the idea some time back of switching out the entire scheduling loop so that we could do away with the whole task creation and queuing using in 'normal' hpx tasks and instead have a minimal openmp-like one.

May 06 '20 14:05 biddisco

hpx hpx copied to clipboard

improve scheduling overheads to support fork-join style work (was N-ary tasks)

New feature

hpx
hpx copied to clipboard