qthreads Call a function on each hardware thread

For certain low-level tasks, it is necessary to call a function exactly once on each hardware thread, sometimes even concurrently. For example, I might want to check that the hardware threads' CPU bindings are correctly set up by calling the respective hwloc function, or I might want to initialize PAPI on each thread.

(Why do I suspect problems with CPU bindings? Because I used both OpenMP and Qthreads in an application, and I didn't realize that both set the CPU binding for the main thread, but they do it differently, leading to conflicts and 50% performance loss even if OpenMP is not used, and the OpenMP threads are all sleeping on the OS. These kinds of issues are more easily debugged if one has access to certain low-level primitives in Qthreads.)

I have currently a mechanism to work around this, by starting many threads that busy-loop for a certain amount of time, and this often succeeds. However, a direct implementation and an official API would be convenient.

Mar 26 '17 20:03 eschnett

While it's not obvious or documented, you can use qt_loop() for this purpose. qt_loop() guarantees that iterations of the same number will occur on the same processing element AND guarantees that it will spread over all processing elements. Thus, qt_loop(0, qthread_num_workers()-1, func, NULL) will effectively call func once on every (non-disabled) hardware thread.

Is that good enough for your purposes?

Mar 27 '17 18:03 m3m0ryh0l3

Thank you, qt_loop seems to be doing exactly what I need.

Mar 28 '17 16:03 eschnett

Are you sure that qt_loop spreads out the work across all workers? I obtained this output:

$ env FUNHPC_NUM_NODES=1 FUNHPC_NUM_PROCS=1 FUNHPC_NUM_THREADS=8 ./hello
FunHPC: Using 1 nodes, 1 processes per node, 8 threads per process
FunHPC[0]: N0 L0 P0 (S0) T5 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T6 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T7 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T0 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]
FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings]

The number after T is the hardware thread, as reported by qthread_worker(0). As you see, several iterations ran on the same hardware thread 4, while none ran e.g. on hardware thread 1.

Mar 28 '17 16:03 eschnett

Ahhh, I see what you mean. I'm guessing you're using the Sherwood scheduler and aren't forcing a shepherd per core. Qthreads only binds thread (task) mobility to the shepherd, so if your shepherd isn't limited to hardware threads (Sherwood defaults it to a L2 cache domain), then you're absolutely right. Hmmm. I guess having a tool to explicitly make a per-hw-thread callback would be handy in some cases. Until one exists, use the environment variables to limit the shepherd boundaries. I forget the exact variable name, but it's something like QT_SHEPHERD_BOUNDARY=pu

Sent from my iPhone

On Mar 28, 2017, at 12:51 PM, Erik Schnetter [email protected] wrote:

Are you sure that qt_loop spreads out the work across all workers? I obtained this output:

$ env FUNHPC_NUM_NODES=1 FUNHPC_NUM_PROCS=1 FUNHPC_NUM_THREADS=8 ./hello FunHPC: Using 1 nodes, 1 processes per node, 8 threads per process FunHPC[0]: N0 L0 P0 (S0) T5 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T6 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T7 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T0 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings] FunHPC[0]: N0 L0 P0 (S0) T4 [cannot set CPU bindings] [cannot determine CPU bindings] The number after T is the hardware thread, as reported by qthread_worker(0). As you see, several iterations ran on the same hardware thread 4, while none ran e.g. on hardware thread 1.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Mar 28 '17 21:03 m3m0ryh0l3

What performance implications does it have to use fewer shepherds? If tasks don't move, how do the shepherds pick up work?

Mar 29 '17 01:03 eschnett

The performance implications of fiddling with the shepherd/worker balance are somewhat app-specific; generally what reducing the shepherd boundary to a pu (and thus increasing the shepherd count) does is make the scheduler into a pure workstealing model. How that impacts your performance depends on things like cache affinity between adjacent tasks. On the other hand, increasing the shepherd boundary (e.g. To a socket, thus decreasing the shepherd count) lets inter-task cache affinity get closer to approximately serial performance (this is a kinda deep question, and I can point you to an academic paper if you really want to dig into it.

Sent from my iPhone

On Mar 28, 2017, at 9:51 PM, Erik Schnetter [email protected] wrote:

What performance implications does it have to use fewer shepherds? If tasks don't move, how do the shepherds pick up work?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Mar 29 '17 04:03 m3m0ryh0l3

@eschnett was this answer sufficient?

Jun 13 '17 16:06 npe9

qt_loop did not work for me. I am still using my original work-around, which is to start a set of threads, each blocking until all threads are running.

Jun 13 '17 18:06 eschnett

@eschnett This actually dovetails into some work I'm doing here. I'll see if I can fix the problem. Can you give me sample code, along with examples of the expected and actual behavior?

Jun 13 '17 18:06 npe9

The issue with qt_loop seems to be that it doesn't start one thread per core -- it possibly starts the same number of threads for each shepherd, but that isn't sufficient for me. I really need to start one thread per core, e.g. to set up thread affinity via hwloc. (There is some related discussion above regarding schedulers, shepherds, workers, and cores.)

As example code, I would call hwloc and output the hardware core id for each thread.

Jun 13 '17 18:06 eschnett

Have you looked at the binders options at all?

Jun 13 '17 21:06 npe9

Yes, I've looked at Qthreads' CPU binding support. The issue is that I might run multiple MPI processes per node, which means that different processes need to use different sets of cores. Setting environment variables to different values for different MPI processes is difficult.

An ideal solution would be if Qthreads had a way to pass in the node-local MPI rank and size.

Jun 14 '17 02:06 eschnett

What if MPI used Qthreads?

Jun 14 '17 16:06 npe9

@npe9 I what sense would/could MPI use Qthreads?

Jun 14 '17 17:06 eschnett

Imagine if MPI's underlying threading runtime (for progress and computation threads) were actually Qthreads. So if you're using MPI and Qthreads together they just "work". This space has been mined before (cf. http://dl.acm.org/citation.cfm?id=2712388). If can help you get Mpiq up if you want to play with it.

Jun 14 '17 17:06 npe9